fastCodeML | Lonely Joe Parker

(first posted at SBCS-Evolve)

Valle and colleagues have just released FastCodeML, an application for the detection of positive selection in DNA datasets. This is a successor SlimCodeML, which is itself an improved implementation of the codeml program from Ziheng Yang’s workhorse phylogenetic analysis package PAML, which continues to be extremely popular, moving through several versions and has at long last acquired a GUI . Note ‘analysis’ – these programs are not intended for phylogeny inference (finding phylogenetic trees from data, although they will do a rough search if required). In fact, for many people the main program of choice for phylogeny inference itself is RAxML, which is also produced by the same lab (Stamatakis). So – given this lab’s pedigree, I was interested to see how FastCodeML compares to codeml, the implementation the authors seek to replace.

Installation

To install FastCodeML you’ll find a set of tar’ed distributions at their FTP site, ftp://ftp.vital-it.ch/tools/FastCodeML/. It wasn’t immediately clear which is the most recent build, but I downloaded both a June 2013 version and the 1.01 release just in case. This includes a precompiled binary for Linux (we’re on CentOS for our dev cluster) and as I was ~~feeling lazy~~ keen to test the performance of the un-optimised binary that ships with the package, I used this. However a simple configure / make / make check / make install will do the job as well. I had to fiddle with a couple of permissions, but that’s all stuff unique to our dev cluster and common to everything we install, so it’s hardly fair to penalise FastCodeML for that…

Interface

FastCodeML, like RAxML, uses arguments passed to the command-line at runtime rather than a GUI (like, say HYPHY or Garli) or a configuration script (like PAML or BEAST.) As a result (assuming you’ve added the FastCodeML executable, fast, to your $PATH) you can paste and go, e.g.:

fast -m 22 -bf -hy 1 ./<tree-file> <sequence-file>

I’m ambivalent about passing arguments on the command line for phylogenetic analyses. Let’s look at the pros and cons: A GUI is more attractive to many users, but with the important caveat that it is well designed, otherwise the useability traded off against performance / flexibility will be degraded. A lot of phylogeneticists are used to working without GUIs and (more importantly) crunching large genomic datasets these days; so the lack of a GUI isn’t really a problem.

What about config files, the approach taken by the original PAML package? Well, they are fiddly at times, but there are a number of advantages. Firstly, the config file itself is a useful aide-memoire, and default parameter values can be suggested. Secondly, and perhaps most importantly in a phylogenomic context, the config file acts as a record of the analysis settings. It can be easier to debug pipelines with config files as well. However, a drawback of config files is their size, and in a pipeline context relying on another file existing where and when you need it can be risky. Passing arguments on the other hand avoids this, and also allows you to work more efficiently with pipes.

About the most significant drawback of the pass-by-argument approach is that as the number of parameters/options increases, keeping track of your own intentions becomes much harder. For example, running a BEAST analysis using arguments would be virtually impossible. In our pipeline API I coded a wrapper for RAxML where the various options could be laid out clearly in the comments, and various operation modes defined as enumerated types; but despite years of using RAxML I still forget certain command-line options from time-to-time if running single analyses. Overall, though, for simple applications like RAxML, MUSCLE or FastCodeML, I think I prefer command-line arguments as simpler and cleaner. Hopefully feature creep won’t bloat the option set of future releases.

Output

The output from FastCodeML is pleasingly clean and simple:

Doing branch 0

LnL1: -2393.120011398291808 Function calls: 1259

  0.1982869  0.1727072  0.2111106  0.0741396  0.0443093  0.0160066  0.0102474  0.0050577
  0.0084463  0.0254361  0.1328334  0.0754612

p0:  0.4817193  p1:  0.5182807  p2a:  0.0000000  p2b:  0.0000000
w0:  0.0000010  k:   5.1561766  w2:   1.1258227

This contrasts with several screens’ worth of ramblings from PAML, on both stdout and output streams (even with verbose disabled). To get the same data from PAML you have to dig about a bit in the stdout or results file, or else look at rst and find:

dN/dS (w) for site classes (K=3)

site class             0         1         2
proportion        0.55182   0.32889   0.11929
branch type 0:    0.05271   1.00000   3.43464
branch type 1:    0.05271   1.00000   0.00000

The two methods are broadly in agreement, to a first approximation: both the proportions and omegas (dN/dS) estimated are fairly similar. I wasn’t too worried about the numerical value of the estimates here, as there are subtle differences in the implementation of both.

But what about their speed? This is the major selling-point of FastCodeML after all; there are many competing methods to detect natural selection but the pedigree of Stamatakis and colleagues promises a substantial improvement over PAML. There are two considerations here: wall-clock time, the time the calculation appears to take to the user (most relevant to single local analyses); and CPU time, which excludes I/O and time the thread spends sleeping or suspended – for cluster work this is far more important since a majority of jobs will be scheduled and CPU time rationed.

Well, according to POSIX::time(), PAML does fairly well:

real	0m50.872s
user	0m50.781s
sys	0m00.006s

But what about FastCodeML?

real	0m05.641s
user	1m22.698s
sys	0m01.158s

Pow. The important thing to note here is that the speed-up is better across both CPU/real time and wall clock/user time; but that the improvement factor is much greater for CPU (around x2 better for wall clock time, vs. x10 for CPU time.) This is extremely promising and suggests that cluster/pipeline operations should be significantly faster using FastCodeML…

Conclusions

If I’d been bothered enough to use some sort of boxing analogy throughout this post, I’d be using metaphors like ‘sucker punch’ and ‘knockout blow’ by now. FastCodeML really is fast, and impressively so. It also offers a reduced filesystem footprint and cleaner I/O, which will likely make it ideal for cluster/pipeline work. Assuming the accuracy and specificity of its implementation are at least comparable to PAML (but that’s a much harder thing to test) I’m likely to use FastCodeML for most analyses of natural selection in future.

Tweet this Digg Post to LinkedIn Slashdot Stumble This