Author: Kevin Thornton ()
libsequence is a C++ library designed to aid writing applications for genomics and evolutionary genetics. A large amount of the library is dedicated to the analysis of "single nucleotide polymorphism", or SNP data. The library is intended to be viewed as a "BioC++" akin to the bioperl project, although the scope of libsequence is limited in comparison. Much of the bioperl project concerns parsing the output of various bioinformatics programs, and the management of databases of biological data. perl is a good language for such things, and libsequence tries not to re-invent the wheel. Rather, the focus is on biological computation, such as the analysis of SNP data and sequence divergence, and the analysis of data generated from coalescent simulation.
Library features include:
- A class hierarchy to represent DNA and protein sequences, inheriting from Sequence::Seq. Various sequence formats are accomodated simply by publicly deriving from this class and defining read/write functions. Currently, Sequence::Fasta is implemented for reading Fasta sequences. Sequence classes are modeled after std::pair<std::string,std::string> to represent the name and data of a sequence object, and can be typecast to std::string.
- A set of template classes for Alignment I/O. Currently, there are classes for ClustalW and phylip format I/O.
- A class hierarchy representing polymorphism (SNP) tables, inheriting from Sequence::PolyTable. This class is extensible for I/O in the same way as Sequence::Seq. Member functions of polymorphism table classes allow the application of frequency filters, purging of missing data, etc. PolyTables can be accessed as two-dimensional arrays (i.e. x[i][j] to access the state of j-th site in the i-th individual), or by iterators, which allow algorithms to be written an applied to polymorphism tables as you would for STL containers.
- The classes Sequence::SimParams, and Sequence::SimData allow data to be read in from Dick Hudson's coalescent simulation program ms and manipulated. Sequence::SimData is a model of Sequence::PolyData, and thus has all the functionality mentioned above, providing a complete system with which to write programs to analyze simulated data sets.
- The class Sequence::PolySNP provides functions to calculate summary statistics on the data stored in polymorphism table objects. These statistics include Watterson's theta, nucleotide diversity, Tajima's D, the Fu and Li statistics, Fay and Wu's H, and several others. There are also member functions to calculate statistics related to linkage disequilibrium (LD), such as r^2, Rmin (the Hudson and Kaplan "4-gamete test" statistic), and Hudson's (1987) estimator of 4Nr, the population recombination rate.
- The class mentioned in 4 above handles missing data (untyped SNPs). This is done by adjusting the sample size of the site for each site that has missing data, then summing accross sites to calculate the statistic. This procedure is unbiased with respect to the expectation of the statistic, but there is no guarantee that this is a minimum-variance approach. If anyone ever does any theory on this issue, the implementations will be updated. In my view, this is a substantial improvement over DNAsp (the most commonly-used software for SNP analysis), which just tosses out all sites with missing data, leading to loss of information.
- As of libsequence 1.5.4, the library contains functions and data types allowing samples to be simulated under coalescent models with recombination. The algorithm is modeled after that in Dick Hudson's program "ms". This is not intended to replace ms, which does a lot more for you in terms of modeling, but rather to make custom simulations easier to write. One example is when doing approximate Bayesian inference, it's handy to be able to generate samples inside a program, instead of by system calls to ms. In the future, I will add a section to the tutorial giving an overview of this aspect of the library. The coalescent code has the following features:
- Efficiency on par with ms for models with and without recombination
- Recombination routine matches up with analytical predictions (correlation of TMRCA for 2 sites in sample size 2, and the probability that no mutations fall in either of 2 sites in sample size 2, given 4Nu and 4Nr).
- Scaling of time, etc., is up to the programmer
- Capability to simulate models with arbitrary genetic maps by providing a probability density function of per-site recombination probabilites along a sequence.
- Template-based design is independent of the random-number generator system.
- Data types "take care of themselves" in terms of copying, etc.
- There are example programs showing how to write simple neutral coalescents
- infinite-sites mutation routines can return types of class Sequence::SimData that can be analyzed immediately using other library routines
- Sequence::Comeron95, implementing the Ka/Ks (dN/dS) method of Josep Comeron, which is very similar to the classic method of Wen-Hsiung Li's ("Li 1993").
- Functions to translate coding sequences. Currently, the universal genetic code is implemented.
- A whole slew of functions, function objects, and templates that can be useful in computational genomics.
- A complete developer's reference manual, generated from the source code using doxygen.
If you use libsequence to develop applications, or use applications based on it, please cite:
Thornton, K. (2003) libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics 19(17): 2325-2327.
libsequence is written and maintained by Kevin Thornton. Dick Hudson and Jeff Wall kindly made some of their C code available, which has been adapted into the library. Eli Stahl provided a lot of useful advice during the course of several years of discussion.
History of libsequence
During my PhD work, I did a lot of bioinformatics and analysis of molecular population genetic data. Perl and bioperl got most of the bioinformatics work done, but perl scripts were too slow for population genetics, especially reading in and doing computation on data from coalescent simulations. I started writing routines in C++ for that application, which led to writing routines to handle sequence data, and then everything got way out of control and I decided to toss it all together into a library.
libsequence is developed on a GNU/Linux system using the gcc compiler. The library is known to work (i.e. compile succesfully and functions exhibit the expected behavior) on the following systems:
gcc 3.x (3.2,3.3,3.4)
|OS X (Jaguar and Panther releases)||gcc 3.x (get the latest developer's CD possible)|
|OS X (Tiger)||gcc 4 ("Xcode 2"). Note: libsequence compiles cleanly on Tiger, but I haven't done any testing. Also, gcc 4 is treated as an "unknown compiler" in boost, meaning that there may be issues with some of the code that libsequence depends on. Reports of problems are appreciated.|
I have also used the library on Sun machines running Solaris using gcc 2.95 and 3.2. It worked fine. However, I no longer have access to such a machine, so I can't make claims about the current version (but I'd wager it works).
libsequence is written in ANSI C++ and ANSI C (C89, to be particular). The STL is used heavily, as are templates. Most modern C++ compilers should be able to compile the library. There is currently (as of version 1.3.4) no use of partial template specialization, which is a feature of C++ that may not be implemented by all compilers yet (but it certainly is by the commonly-used ones). I do use template member functions (in non-template classes), which may pose a problem on some systems.
Some compilers also have problems if some of the standard C++ headers are missing. For example, versions of gcc before the 3.x series did not have <limits>, requiring the use of either <climits> or <limits.h> . As of version 1.3.4, I started using autoconf to check for these problematic headers, which solves the problem at the level of the configure scripts. I will continue this policy as problems are reported regarding the presence/absence of headers--these are easy fixes to make.
Diclaimer: I make no claims that this code compiles under Microsoft's Visual Studio. The MS implementation of C++ has been problematic in it's support of the STL. However, the code remains untested on that platform. I'd be interested in hearing of any success/failure stories.
libsequence is designed to have as few compilation dependencies as possible. Currently, the only dependency is on the boost library. Fortunately, boost is really easy to install. If your operating system doesn't have a boost installation available (Debian does, btw), just download it, and do the following (be very careful to type the commands correctly):
note: the first 3 commands will not be necessary if you already
have those directories on your system.
- sudo mkdir /usr/local/include
- sudo mkdir /usr/local/lib
- sudo mkdir /usr/local/bin
- tar xzf boost-version.tar.gz
- cd boost-version
- (wait for a while. you can ignore errors about some libraries not compiling correctly)
- sudo make install
- sudo ln -s /usr/local/include/boost-version/boost /usr/local/include/boost
libsequence is almost as easy to install as boost:
- tar xzf libsequence-version.tar.gz
- cd libsequence-version
- Note: there is a special feature for OS X users. As of libsequence 1.5.6, you can configure the library to produce cpu-specific optimized code. On a G4 powerpc using gcc (i.e. Xcode, i.e. Developer Tools), run ./configure --enable-G4=yes, and on a G5 system, ./configure --enable-G5=yes
- sudo make install
A technical point (esp. for OS X users!):
This section concerns setting up your system so that you can compile and link to libraries that don't come with your system (and hence are installed in directories other than /usr/lib). The steps described below should be taken before installing libsequence (or any Unix software from source, for that matter). However,
libsequence is a "dynamic library". This means that it is a chunk of code that sits somewhere on your computer. There are programs sitting somewhere else that need to use code in the library. This means that programs need to know where the library is. Most systems only look in /usr/lib by default, and some enlightened systems may look in /usr/local/lib. However, say you use fink or darwinports on an OS X machine to install the GSL, for example. Using fink, the GSL library will be in /usr/local/lib, and using opendarwin it will end up in /opt/local/lib. There is no guarantee then that programs needing to link to the GSL will be able to find it. On Unix-like systems, there are shell variables that fix this. They are called LD_LIBRARY_PATH and DYLD_LIBRARY_PATH (on OS X).
If you use the bash shell, add the following lines to your .bash_profile:
export LD_LIBRARY_PATH DYLD_LIBRARY_PATH
If you use a csh-like shell (the default on OS X, before 10.3), see the instructions here.
- libsequence 1.6.5 and "analysis" 0.6.9 (Nov 29, 2007). Compile fine now on Apple's "Leopard" operating system (OS X 10.5.x)
- libsequence 1.6.4 (Jun 19, 2007). Haplotype statistics for SNP data were incorrectly calculated in some cases. When alignment columns containing polymorphic sites also contain a mix of upper- and lower-case letters, the extra case was incorrectly detected as an extra haplotype. This affected the following statistics: # haplotypes, haplotype diversity, Wall's statistics, and Hudson's (1987) estimator of 4Nr.
- libsequence 1.6.3 (Jun 22, 2006). Not a bug fix, so much as a design fix. Template functions taking random number generators as type arguments are now compatible with boost::bind + gsl_rng_whatever.
- libsequence 1.6.2 (March 5, 2006). Three bugs fixed. First, in the FST module, when there are lots of populations, the sum of all the population weights was sometimes detected as not summing to 1, when they actually did. The bug was not accounting for numerical precision, which is now fixed. Second, a rare segfault in the sliding window module (PolyTableSlice.hpp) is now fixed. Third, pairwise LD statistics were being calculated incorrectly in the presence of an outgroup. Now fixed.
- libsequence 1.6.1 (Jan 6, 2006) fixes five bugs. Most of these were minor. Users of the "compute" program who used the option to calculate Watterson's theta
using only the number of variable sites are affected, and may want to re-run. (If you used the default options in the program, your output will not be affected).
- src/CoalescentInitialize.cc: Fixed bug in init_sample. Regions with 0 "sites" now have the last "site" in the chromosome labelled 0, which is correct. Previously, it was labelled 1.
- src/HKA.cc: Fixed minor error in calculation of f_hat which was pointed out by Andy Kern. I was taking the mean(f_hat) across loci as the estimate, rather than following equation 5 of the HKA paper closely enough. The effect was small, and estiamtes of theta and T_hat were unaffected or affected only very little, repsectively.
- src/PolySNP.cc: Fixed glitch in calculation of ThetaW. When totMuts==false, only the number of bi-allelic SNPs was used in the numerator of ThetaW. Now, the total number of variable sites is used, as intended.
- src/Recombination.cc (affects calculation of LD statistics): In the calculation of D', the 11 gamete was ancestral/ancestral, rather than derived/derived. This is now fixed to match the documentation.
- src/CoalescentTreeOperations.cc: in function total_time_on_arg, there was a bug leading to an exception always being thrown. This was due to an inequality being tested as < , when > was what was intended.
- libsequence 1.6.0 (Oct 15, 2005) Bug fix: when a PolySNP/PolySIM object was constructed from a SimData object that has 0 segregating sites, the value of 0 was returned for number of haplotypes. It now correctly returns 1.
- libsequence 1.5.8 (July 29, 2005) version 1.5.7 had a change in how memory allocation was handled for coalescent simulations with recombination. The change resulted in way too much memory being allocated and not recovered (a leak). This has been fixed, and the code changed back to that found in 1.5.6 and previous
- libsequence 1.5.7 (July 19, 2005) fixes a bug in Sequence::SimpleSNP (the SNP table format used for programs like Hudson's maxhap program). The bug resulted in some columns getting dropped upon input.
- libsequence 1.5.5 (Jun 9, 2005) fixes a bug in Sequence::PolySNP where preprocessing was not correctly applied. The effect of the bug was that some results would be incorrect when performed in certain orders. This affected no programs that I have written and distribute, and analysis of simulated data (using Sequence::PolySIM) was unaffected. In short, it shouldn't have hurt anybody. The non-const operator of Sequence::PolyTable now sets non_const_access to true (this fixes a minor bug, which again will rarely affect anything because most operations in the library are on const PolyTables).
- libsequence 1.5.3 (April 21, 2005) bug in PolySNP::ThetaPi() (i.e. calculation of pi for sequence data) where sites in tables where all but 1 individual had missing data were not handled correctly.
- libsequence 1.5.2 (April 8, 2005) The bugfix in 1.5.1 was incomplete. Now fixed and tested.
- libsequence 1.5.1 (April 5, 2005) Bugfix in sliding window code for non-overlapping windows along the DNA sequence.
- libsequence 1.5.0 (Jan 11, 2005). Several minor bugfixes: implicit typenames fixed so that library now compiles under gcc 3.4, a bug in the documentation of the formula for PolySNP::ThetaPi() is now corrected, <ieeefp.h> is now included in the HKA source code for systems that have that header (i.e. cygwin), and the estimate of the sample variance in <Sequence/descriptiveStats.hpp> is now unbiased (i.e uses n-1 rather than n in the denominator). In addition, there is a policy change in the calculation of several summary statistics in the classes Sequence::PolySNP and Sequence::PolySIM. When a summary statistic is undefined, the value nan (not a number) is now returned.
- libsequence 1.4.9 (Dec. 3, 2004) The HKA code had a bug in it that basically rendered the module useless (all results where undefined values). Now fixed. Also, various delcarations have been fixed to allow compilation under gcc 3.4.
- libsequence 1.4.8 (Nov 3, 2004) A major bug in counting derived sites was fixed in this release. Analyses that use an outgroup should be repeated after upgrading.
- libsequence 1.4.7 (Oct 7, 2004) Class Sequence::PolySIM now correctly sets both Wall's B and Q to -1 if there are < 2 segregating sites. Previously, only Wall's B was set correctly.
- libsequence 1.4.6 (Oct 5, 2004) Improved handling of data sets with no segregating sites for HKA calculations. Sequence::SimpleSNP now checks its input and makes sure that it's what's expected.
- libsequence 1.4.5 (Aug 22, 2004) Several small bugs fixed. The output shift operator for template class AlignStream now takes a const reference. The input routines for SimParams now reserve memory, which seems to have fixed a rare crash on OS X systems. The implementation of Spearman's Rank correlation is now much more efficient. There are also several new features added, so see the README.txt in the download directory for details. Note: this release breaks binary compatibility (but not interface compatibility).
- libsequence 1.4.4 (Jul 8, 2004) Minor bugfix release. All files using toupper now #include cctype
- libsequence 1.4.3 (Jun 4, 2004) fixes bug in the counting of singletons and derived singletons for sequence data with lots of missing data. This also affected the output for Fu & Li statistics in such cases.
- libsequence 1.4.2 (May 14, 2004) fixes the following issues: *removal of unused variables in several place *efficiency improvement in Sequence::NumDiffs() (in Comparisons.hpp) *updated Sequence::uniform (in BoostRandomNumbers.hpp) so that it compiles and is compatible with std::random_shuffle *fixed some bugs in Sequence::SimpleSNP (sample size now printed correctly upon output when there is an outgroup) *fixed bug in Recombination::Disequilibrium (frequency filter now works, case-insensitivity implemented). And, for D and D', the 11 gamete type is in terms of the minor (or derived) allele *fixed use of assert() to deal with compiler warnings NOTE: if you used the "rsq" program in my analysis package, you may want to re-run those analyses after upgrading
- libsequence 1.4.1 (April 26, 2004) fixes several important bugs. First, in the number of external mutations was calculated incorrectly in recent versions of libsequence, resulting in bad values for the Fu and Li statistics. Second, Wall's Q was incorrect (due to a typo in the original paper), and Wall's B was occasionally incorrect (when doing the calculation on polymorphism tables containing invariant ingroup sites). In addition, the class SimpleSNP no longer segfaults on input from files containing outgroup data.
- libsequence 1.4.0 (April 16, 2004) fixed a bug that caused Fu and Li's D to often return -inf.
- libsequence 1.3.9 (April 12, 2004) fixes a bug in 1.3.8 where all the codons processed during Translation of a sequence were printed to stdout. 1.3.9 is also updated to libtool 1.5.4, which hopefully won't cause any trouble :)
- libsequence 1.3.8 (released Thu, April 8, 2004) fixes 2 bugs. The first was in Sequence::PolyTable::Binary(), and resulted in the binary-format data not being assigned to the object. The second bug was in Sequence::SimData, and resulted in "ms output" being read in incorrectly when using C++-style input. The member function SimData::fromstdin(), which uses C-style input for speed, however, worked just fine.
- The program "MKtest" in versions of the analysis package up to 0.5.6 generated SNP tables incorrectly. Rather amazingly, this was done in a way that had no effect on the outcome for close to 40 data sets (!!). This is because the SNP tables are only used as guides to parse the rest of the alignment. The bug was that the positions were correct, but the states were not. But, the states were then looked up from the original alignment, resulting in correct output. The bug was still fixed for version 0.5.7.
- The Rmin (Hudson & Kaplan 1985) statistic was miscalculated when used on SNP data. The bug affected relatively few data sets (2 in my test set of 30). The reported value was 1 less than it should have been. This bug is fixed in libsequence-1.3.5. Analysis of simulated data, using Sequence::PolySIM, was unaffected by the bug. Users of the "compute" program in my analysis package may want to re-run analyses after upgrading if they uses the Rmin values reported.
Get the latest version here.
There is a reference manual, available in html, and as a pdf. There is also a tutorial, which is the most accessible introduction to the library. The reference manual is generated by the handy doxygen tool. The tutorial is written in LaTeX, and I'll make that code available in the near future.