Tag Archives: phylogenetics

BaTS (and Befi-BaTS), SHiAT, and Genome Convergence Pipeline have moved!

Important – please take note!
Headline:

  • All my phylogenetics software is now on GitHub, not websites or Google Code
  • Please use the new FAQ pages and issue/bug tracker forms, rather than emailing me directly in the first instance

Until now, I’ve been hosting the open-sourced parts of my phylogenetics software on code.google.com. These include the BaTS (and Befi-BaTS) tools for phylogeny-trait association correlations; the alignment profilers SHiAT (and Genious Entropy plugin), and the Genome Convergence API for the Genome Convergence Pipeline and Phylogenomics Dataset Browser. However, Google announced that they are ending support for Google Code, and from August all projects will be read-only.

I’ve therefore migrated all my projects to GithubThis will eventually include FAQs, forums and issue/bug tracking for the most popular software, BaTS and Genome Convergence API.

The projects can now be found at:

 

I am also changing how I respond to questions and bug requests. In the past I dealt with questions as they came in, with the odd explanatory post and a manual or readme with each release. Predictably, this meant I spent a lot of time dealing with duplicates or missing bugs or feature requests. I am now in the process of compiling a list of FAQs for each project, as well as uploading the manuals in markdown format so that I can update them with each release. Please bear with me as I go through this process. In the meantime, if you have an issue with a piece of software or think you have found a bug, please:

  1. Make sure you have the most recent version of the software. In most cases this will be available as an executable .jarfile on the project github page.
  2. Check the ‘Issues’ tab on the project github page. Your issue may be a duplicate, or already fixed by a new release. If your bug isn’t listed, please open a new issue giving as much detail as possible.
  3. Check the manual and FAQs to see if anyone else has had the same problem – I may well have answered their question already.
  4. If you still need an answer please email me on joe+bioinformaticshelp@kitserve.org.uk

Thanks so much for your support and involvement,

Joe

Interpreting ‘tree space’ in the context of very large empirical datasets

Seminar presented at the Maths Department, University of Portsmouth, 19th November 2014

Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees) that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process. Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds. Unfortunately, our conception of ‘tree space’ – a notional multidimensional surface containing all possible phylogenies – is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.

Slides [SlideShare]: cc-by-nc-nd

[slideshare id=41858965&doc=joeparker-multiplephylogenies-141121092909-conversion-gate02]

HYPHY Hack: Passing arguments to HYPHY for phylogenetics using the command-line

Important update, 2017-Feb-07 ]

This solution, already a bit hacky, should now be considered a last-resort. Sergei and colleague Stephen Weaver have suggested a much more elegant solution; see: https://github.com/veg/hyphy/issues/522You’ll still have to dive into the batch file you want to iterate over (to work out what user options are presented, in which order) but you should not have to edit the batch files themselves directly. The solution below may no longer work for some versions of HyPhy, owing to altered fscanf() behaviour. ]

HYHPY, is a great platform for advanced phylogenetics by Sergei L. Kosakovsky Pond, Simon D. W. Frost and Spencer V. Muse, where abstract concepts such as likelihood-ratio tests, model selection, and phylogenetic inference are represented and manipulated by means of a powerful and flexible object-oriented language called Hyphy Batch Language, or HBL, using workflows known as ‘batch files’ (actually more like routines). A large number (around a thousand) publications to date have made use of HYPHY, which includes additional features such as a GUI and ready-to-use implementations of advanced published methods. It also powers the datamonkey.org online phylogenetics server.

However, for all this flexibility, HYPHY actually has an ugly side: Because the batch file system is so central to operations, there isn’t a convenient way to send pass arguments to HYPHY via the command-line. Yes, there are plenty of ways to get data into HYPHY at or before runtime (hard-coded options; reading in config files; dialog prompts on the command-line or GUI), but none that correspond to a standard POSIX-style program argument. In a phylogenomics context this caused our group some problems…

The problem

Let’s suppose we have a set of loci (perhaps a few thousand), with different names. An earlier pipeline has produced a set of subdirectories, one for each locus, with an alignment file and a phylogenetic tree in each. Say we want to run the same positive selection test (I’ll assume the branch-site random-effects likelihood test for this post, implemented already in HYPHY as the BranchSiteREL.bf batch file) on each in HYPHY – how can we do that? We have a few options:

  1. Run HYPHY in GUI mode: This has the advantage of being easy to do. But it’s incredibly demanding of human input – who’s going to sit and click through thousands of HYPHY sessions? This input will also make it slower (depending on the analysis, the human component might be the limiting step); and it will certainly introduce the potential for human errors.
  2. Create a custom HYPHY batch file, and rename the input files in each locus: In other words, a script which looks for input files named something like ‘input.fasta‘ and ‘input.tre‘, and executes them. Unfortunately, there’s a risk we might over-write files we don’t want to, if one or more HYPHY calls fail part-way through. It could also be hard to parallelise this.
  3. Create a custom HYPHY batch file to loop through the input directories: This is how we probably ought to do things natively in the ‘HYPHY way’ – HBL is powerful enough to let us do things like read directory contents, split and test and generally manipulate strings etc. So we could probably work out how to write a wrapper batch file in HBL for HYPHY that would call BranchSiteREL.bf . But do we really want to delve deeply into yet another language just to do that? And suppose we wanted to run the same analysis on another set of data in a month or so – we’d have to edit the wrapper file to loop through a different directory…
  4. What we really want to do is pass arguments to HYPHY using the command-line: That is, we want to be able to use the STDIN standard input stream to pass the input alignment and phylogeny files’ paths to HYPHY, read them into BranchSiteREL.bf  as variables, and execute the batch file with no further input. This method will be flexible – we can use any paths we want, and change them at any time – and modular because we won’t have lots of different BranchSiteREL.bf files sitting about for analyses at different times, just one.

It turns out that it’s actually pretty easy to do this – it took me an hour or so to work it out, and a couple more for implementation and testing – and with this guide you should be able to do it far quicker. There are several steps:

  1. Refactor the existing batch file to collect key variables
  2. Edit batch file to read variables from STDIN
  3. Call HYPHY in command-line mode, passing variables in-place as a ‘here’ string

That’s it! Here are the steps in detail:

1. Refactor the existing batch file to collect key variables

(NB: links to my hacked copies further down this page)

If you’re not familiar with HYPHY (and if you were, you probably wouldn’t be interested in this hack), this will be the intimidating bit. But relax: if you know C, Java, Perl, or any modernish procedural language, this is easy.

What we want to do is take the existing standard analysis batch file which came with HYPHY, BranchSiteREL.bf, and work out all the places where HYPHY expects user input. We’ll need to either hardcode those, or pass variables from the command-line. To make this less likely to break, we’re going to a) work on a copy of the batch file (mine’s called BranchSiteREL_joeHack.bf), and b) refactor the code so all those variables are initialised right at the start of the batch file, where we can see them.

To start with, run the batch file in GUI mode as normal. This lets you check the input files are actually formatted correctly. Also note down all the points where the script asks for input, and what you want those inputs to be. In the REL test, the steps are: pick genetic code (‘universal’); input alignment (‘hyphy-input.fasta’); input phylogeny (‘hyphy-input.tre’); and output file (‘hyphy-output.REL’ but really, output file prefix – there’s several outputs in fact, which will share this prefix). Now we can go to the head of the copied BranchSiteREL_joeHack.bf file, and set these variables up. To start with, we’ll hardcode them. Later, we’ll read them from the command line via standard input. I’ve used ALL_CAPS variables for readability, not that HBL cares:

/* Variables we'll define and later set by STDIN */
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_GENETIC_CODE = 1;
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";

/* Start of normal batch file */
skipCodeSelectionStep = 0;
LoadFunctionLibrary("chooseGeneticCode_HardcodeUniversal");

LoadFunctionLibrary("GrabBag");
LoadFunctionLibrary("dSdNTreeTools");
LoadFunctionLibrary("CF3x4");
LoadFunctionLibrary("BranchSiteTemplate");
...

So the four variables we’ve introduced are: JOE_HARDCODE_ALIGNMENT; JOE_HARDCODE_PHYLOGENY; JOE_HARDCODE_GENETIC_CODE; and JOE_HARDCODE_OUTPUT. We’ve defined these, but they’re not actually used anywhere yet – as things stand, HYPHY will still try and ask the user for input. What we need to do instead is go through the batch file looking for methods that prompt the user for input, and replace them with our variables instead. From a quick read of the HBL documentation (nb, the HTML documentation that comes with HYPHY is more useful), there seem to be two main ways HYPHY gets user input. They are:

/* fscanf() - reads input to a variable, e.g from console (command-line) to a string, as here: */
fscanf(stdin,"String",SOME_VARIABLE);
/* PROMPT_FOR_FILE, a special variable that opens a system dialog/file chooser, as here: */
DataSet ds = ReadDataFile(PROMPT_FOR_FILE);

All we need to do is look through the batch files and the places where the user interactions we noted in our GUI session happened, and replace the fscanf()‘s or PROMPT_FOR_FILE‘s with our variables. Then when we change the variables from being hardcoded to being passed as arguments at the command-prompt, we’ll have our complete program. In the case of BranchSiteREL.bf, there are in fact a number of included scripts (additional batch files or model definition files) used in the analysis – so in some cases we need to change those too. Make sure to use copies and rename them…

The datafile (alignment)
This is found in BranchSiteREL.bf:11, as above. This line is easy to find and change:

11
12
13
14
15
16
17
DataSet ds = ReadDataFile(PROMPT_FOR_FILE);
/* Change PROMPT_FOR_FILE
to our initialised JOE_HARDCODE_ALIGNMENT

Make sure to _replace_ 'PROMPT_FOR_FILE'
or comment out the old line if you want to copy it! */

DataSet ds = ReadDataFile(JOE_HARDCODE_ALIGNMENT);

The output files’ prefix
This is found in BranchSiteREL.bf:47, as above. Also easy, although PROMPT_FOR_FILE is used in an odd context:

46
47
48
49
SetDialogPrompt ("Save analysis results to");
fprintf (PROMPT_FOR_FILE, CLEAR_FILE, KEEP_OPEN,"Branch,Mean_dNdS,Omega1,P1,Omega2,P2,Omega3,P3,LRT,p,p_Holm")
/* Replace PROMPT_FOR_FILE with JOE_HARDCODE_OUTPUT */
fprintf (JOE_HARDCODE_OUTPUT, CLEAR_FILE, KEEP_OPEN,"Branch,Mean_dNdS,Omega1,P1,Omega2,P2,Omega3,P3,LRT,p,p_Holm");

The tree (phylogeny)
Annoyingly, this is found in a required batch file, not the main one. It’s found in queryTree.bf, so we need to locate this file, rename it, edit it, and also edit the place where it is called so that our hacked version is called instead. queryTree.bf itself is found in the same directory (TemplateBatchFiles) as BranchSiteREL.bf. I copied it to queryTree_hardcode.bf. Within this the relevant line is queryTree.bf:59, with a similar syntax to the output file:

55
56
57
58
59
60
61
62
63
if (!IS_TREE_PRESENT_IN_DATA)
{
SetDialogPrompt ("Please select a tree file for the data:");

fscanf (PROMPT_FOR_FILE, REWIND, "Raw", treeString);
/* As before, replace PROMPT_FOR FILE
with our phylogeny variable. In my case,
JOE_HARDCODE_PHYLOGENY*/

fscanf (JOE_HARDCODE_PHYLOGENY, REWIND, "Raw", treeString);

Because this is an external function library, we need to find where in BranchSiteREL.bf it’s imported, and make sure our hacked copy is instead. We need BranchSiteREL.bf:44

44
45
46
47
LoadFunctionLibrary ("queryTree");
/* Replace with our queryTree_hardcode.bf
(the *.bf suffix isn't needed) */

LoadFunctionLibrary ("queryTree_hardcode");

The genetic code translation definitions
The genetic code translation type is also handled in an external library, chooseGeneticCode.def, but annoyingly, this isn’t in TemplateBatchFiles, but a TemplateBatchFiles/TemplateModels subdirectory. Such is life… again, I’ve worked on a copy, chooseGeneticCode_HardcodeUniversal.def, and after modifying the library itself we need to edit the library call to make sure our hacked version is pulled in. First, the edit, which uses a slightly different, but still intuitive syntax, found at chooseGeneticCode.def:95:

95
96
97
98
99
100
101
102
103
104
105
106
107
108
if (!skipCodeSelectionStep)
{
/* this is where the user input routine ChoiceList() is called... */
ChoiceList (modelType,"Choose Genetic Code",1,SKIP_NONE,_geneticCodeOptionMatrix);

if (modelType < 0)
{
return;
}
/* but this is where the variable is actually set... */
ApplyGeneticCodeTable (modelType);
/* ... so we'll replace modelType with our global JOE_HARDCODE_GENETIC_CODE variable */
ApplyGeneticCodeTable (JOE_HARDCODE_GENETIC_CODE);
}

The corresponding call to TemplateModels.chooseGeneticCode.def in BranchSiteREL.bf is right back at line 2:

1
2
3
4
5
skipCodeSelectionStep = 0;
LoadFunctionLibrary("chooseGeneticCode");
/* Replace the default library with our hacked one -
Note that the subdirectory path isn't needed; the TemplateModels subdirectory is searched by default */

LoadFunctionLibrary("chooseGeneticCode_HardcodeUniversal");

 

2. Edit batch file to read variables from STDIN

Phew! Good news is that was the fiddly bit; the rest of this is all easy. The next step is to replace the hardcoded variable initalisations at the head of our BranchSiteREL.bf copy with fscanf() methods that will assign values to these variables from the standard input (command-line). So we’ll comment out:

1
2
3
4
5
6
7
8
/* Variables we'll define and later set by STDIN */
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_GENETIC_CODE = 1;
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";
/* Start of normal batch file */
skipCodeSelectionStep = 0;
...

And replace them with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Variables we'll define and later set by STDIN */
/* comment out the hardcoded definitions ...
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_GENETIC_CODE = 1;
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";

And replace with stdin read via fscanf(): */

fscanf(stdin,"String",JOE_HARDCODE_ALIGNMENT);
fscanf(stdin,"String",JOE_HARDCODE_PHYLOGENY);
fscanf(stdin,"String",JOE_HARDCODE_OUTPUT);
JOE_HARDCODE_GENETIC_CODE = 1; // OK, we'll keep this one hardcoded for now
/* Start of normal batch file */
skipCodeSelectionStep = 0;
...

These are pretty self-explanatory. Done!

3. Call HYPHY in command-line mode, passing variables in-place as a ‘here’ string

At this point, all we’ve really done is refactor the batch file. We’ve moved where the variables are initalised / set, so that we can find them easily, and we’ve called fscanf() on each them in order to set them. So far, because the implies someone, somehow, will need to type stuff into stdin at a prompt, this doesn’t actually solve our main problem – how to pass variables on the command line to HYPHY – but what it has done is made everything a lot neater. Note that these are still three separate calls to fscanf(), however – which means HYPHY will expect three discrete chunks of user interaction. In a nutshell, if we ran HYPHY now, we’d get something like:

>HYPHY: Please choose a data file:
me: /foo/bar/hyphy_input.fa

>HYPHY: Please select a tree:
me: /foo/bar/hyphy_input.tre

>HYPHY: Please choose a file for output:
me: /foo/bar/hyphy_output.REL

So we need to get bash to accept input from a file or command-line, and pass it onto HYPHY each time HYPHY wants input. The easy way to do this is to put each user response on a separate line in a shell.sh file, and use the ‘<‘ switch to redirect the standard input stream to this file, instead of the keyboard. This might look a bit like:

# in: commands.sh
hyphy-input.fasta # the alignment
hyphy-input.tre # the tree
hyphy-output.REL #the output

# HYPHYMP (the binary) could then be called with:
$user~: HYPHYMP BranchSiteREL_joeHack.bf &lt; commands.sh

But that wouldn’t really help us, would it? We’d have to edit commands.sh separately for each locus! Luckily there is a handy Bash trick which I had to search for a bit – the ‘here’ string (I found this on LinuxJournal). This lets us redirect a string in-place to the command-line, and takes the form:

$user~: command <<<'input_string_to_stdin'

Remembering that we had three fscanf() calls, one for each of our refactored variables, we’ll need three inputs. No problem (StackExchange to the rescue) – we can separate the inputs with newline (‘\n’) characters (we’ll also need the ‘$’ operator, to make sure bash interprets the newlines correctly), like this:

$user~: command <<<$'input_1\ninput_2\ninput_3'

This syntax is equivalent to giving the command command three separate and sequential inputs.

Putting it all together

Finally we’ve got everything we need to run HYPHY in command-line mode. To recap:

  • A command-line friendly version of HYPHY (see this post);
  • The edited versions of BranchSiteREL.bf, chooseGeneticCode.def and queryTree.bf, renamed and in place alongside their original copies;
  • Input alignment and tree files, and a writeable output directory;
  • A means (the ‘here’ or ‘<<<‘ operator) of sending multiple-line inputs to the standard input stream.

Running HYPHY on the command line with arguments passed

Let’s do this! There are a couple of additional options (CPU=integer, which sets the number of cores, and BASEPATH=/path/to/batchfiles, which ensures the right batchfile directory is being used) but don’t worry about those for now.

The complete command is :

/usr/local/bin/HYPHYMP CPU=number_of_cpu_cores BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ BranchSiteREL_joeHack.bf &lt;&lt;&lt;$'/path/to/hyphy_input.fa\n/path/to/hyphy_input.tre\n/path/to/hyphy_output.REL'

You can optionally use stuff like >log.out and 2>log.err to redirect STDOUT and STDERR if you want; also & to fork and leave running etc. But the critical bit of this command is the last bit, after the ‘<<<‘ handle. I’ve only tested this using absolute/full pathnames for the input/output file arguments – it’s a pain but less likely to break in the short-term (what happens if you move the whole project folder is another matter…)

I admit this looks absolutely horrible. But it’s the best I can do.

In practice

So for me (user=jparker) working from /Downloads/hyphy_hacks/hackinput with alignments hyphy-input.fa and hyphy-input.tre, and outputting to files with prefix run2, the complete command is:

/usr/local/bin/HYPHYMP CPU=2 BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ BranchSiteREL_joeHack.bf &lt;&lt;&lt;;$'/home/jparker/Downloads/hyphy_hacks/hackinput/hyphy-input.fa\n/home/jparker/Downloads/hyphy_hacks/hackinker/Downloads/hyphy_hacks/hackinput/run2'

And if I don’t want to wait for it to complete, and send stdout and stderr to some files, the command is:

/usr/local/bin/HYPHYMP CPU=2 BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ BranchSiteREL_joeHack.bf &lt;&lt;&lt;$'/home/jparker/Downloads/hyphy_hacks/hackinput/hyphy-input.fa\n/home/jparker/Downloads/hyphy_hacks/hackinker/Downloads/hyphy_hacks/hackinput/run4' &gt;run4.stdout 2&gt;run4.err &amp;

Lastly you can change the argument to the CPU= command if you want to. Be aware that by default HYPHYMP uses as many cores as it can see (I think)…

Reviewing FastCodeML, a potential replacement for PAML’s codeml

(first posted at SBCS-Evolve)

Valle and colleagues have just released FastCodeML, an application for the detection of positive selection in DNA datasets. This is a successor SlimCodeML, which is itself an improved implementation of the codeml program from Ziheng Yang’s workhorse phylogenetic analysis package PAML, which continues to be extremely popular, moving through several versions and has at long last acquired a GUI . Note ‘analysis’ – these programs are not intended for phylogeny inference (finding phylogenetic trees from data, although they will do a rough search if required). In fact, for many people the main program of choice for phylogeny inference itself is RAxML, which is also produced by the same lab (Stamatakis). So – given this lab’s pedigree, I was interested to see how FastCodeML compares to codeml, the implementation the authors seek to replace.

Installation

To install FastCodeML you’ll find a set of tar’ed distributions at their FTP site, ftp://ftp.vital-it.ch/tools/FastCodeML/. It wasn’t immediately clear which is the most recent build, but I downloaded both a June 2013 version and the 1.01 release just in case. This includes a precompiled binary for Linux (we’re on CentOS for our dev cluster) and as I was feeling lazy keen to test the performance of the un-optimised binary that ships with the package, I used this. However a simple configure / make / make check / make install will do the job as well. I had to fiddle with a couple of permissions, but that’s all stuff unique to our dev cluster and common to everything we install, so it’s hardly fair to penalise FastCodeML for that…

Interface

FastCodeML, like RAxML, uses arguments passed to the command-line at runtime rather than a GUI (like, say HYPHY or Garli) or a configuration script (like PAML or BEAST.) As a result (assuming you’ve added the FastCodeML executable, fast, to your $PATH) you can paste and go, e.g.:

fast -m 22 -bf -hy 1 ./<tree-file> <sequence-file>

I’m ambivalent about passing arguments on the command line for phylogenetic analyses. Let’s look at the pros and cons: A GUI is more attractive to many users, but with the important caveat that it is well designed, otherwise the useability traded off against performance / flexibility will be degraded. A lot of phylogeneticists are used to working without GUIs and (more importantly) crunching large genomic datasets these days; so the lack of a GUI isn’t really a problem.

What about config files, the approach taken by the original PAML package? Well, they are fiddly at times, but there are a number of advantages. Firstly, the config file itself is a useful aide-memoire, and default parameter values can be suggested. Secondly, and perhaps most importantly in a phylogenomic context, the config file acts as a record of the analysis settings. It can be easier to debug pipelines with config files as well. However, a drawback of config files is their size, and in a pipeline context relying on another file existing where and when you need it can be risky. Passing arguments on the other hand avoids this, and also allows you to work more efficiently with pipes.

About the most significant drawback of the pass-by-argument approach is that as the number of parameters/options increases, keeping track of your own intentions becomes much harder. For example, running a BEAST analysis using arguments would be virtually impossible. In our pipeline API I coded a wrapper for RAxML where the various options could be laid out clearly in the comments, and various operation modes defined as enumerated types; but despite years of using RAxML I still forget certain command-line options from time-to-time if running single analyses. Overall, though, for simple applications like RAxML, MUSCLE or FastCodeML, I think I prefer command-line arguments as simpler and cleaner. Hopefully feature creep won’t bloat the option set of future releases.

Output

The output from FastCodeML is pleasingly clean and simple:

Doing branch 0

LnL1: -2393.120011398291808 Function calls: 1259

  0.1982869  0.1727072  0.2111106  0.0741396  0.0443093  0.0160066  0.0102474  0.0050577
  0.0084463  0.0254361  0.1328334  0.0754612

p0:  0.4817193  p1:  0.5182807  p2a:  0.0000000  p2b:  0.0000000
w0:  0.0000010  k:   5.1561766  w2:   1.1258227

This contrasts with several screens’ worth of ramblings from PAML, on both stdout and output streams (even with verbose disabled). To get the same data from PAML you have to dig about a bit in the stdout or results file, or else look at rst and find:

dN/dS (w) for site classes (K=3)

site class             0         1         2
proportion        0.55182   0.32889   0.11929
branch type 0:    0.05271   1.00000   3.43464
branch type 1:    0.05271   1.00000   0.00000

The two methods are broadly in agreement, to a first approximation: both the proportions and omegas (dN/dS) estimated are fairly similar. I wasn’t too worried about the numerical value of the estimates here, as there are subtle differences in the implementation of both.

But what about their speed? This is the major selling-point of FastCodeML after all; there are many competing methods to detect natural selection but the pedigree of Stamatakis and colleagues promises a substantial improvement over PAML. There are two considerations here: wall-clock time, the time the calculation appears to take to the user (most relevant to single local analyses); and  CPU time, which excludes I/O and time the thread spends sleeping or suspended – for cluster work this is far more important since a majority of jobs will be scheduled and CPU time rationed.

Well, according to POSIX::time(), PAML does fairly well:

real	0m50.872s
user	0m50.781s
sys	0m00.006s

But what about FastCodeML?

real	0m05.641s
user	1m22.698s
sys	0m01.158s

Pow. The important thing to note here is that the speed-up is better across both CPU/real time and wall clock/user time; but that the improvement factor is much greater for CPU (around x2 better for wall clock time, vs. x10 for CPU time.) This is extremely promising and suggests that cluster/pipeline operations should be significantly faster using FastCodeML…

 

Conclusions

If I’d been bothered enough to use some sort of boxing analogy throughout this post, I’d be using metaphors like ‘sucker punch’ and ‘knockout blow’ by now. FastCodeML really is fast, and impressively so. It also offers a reduced filesystem footprint and cleaner I/O, which will likely make it ideal for cluster/pipeline work. Assuming the accuracy and specificity of its implementation are at least comparable to PAML (but that’s a much harder thing to test) I’m likely to use FastCodeML for most analyses of natural selection in future.

Using x-windows to render GUIs for server-client phylogenomics

The power of modern computers allied to high-throughput next-generation sequencing will usher in a new era of quantitative biology that will deliver the moon on a stick to every researcher on Earth…

Well, quite.

But something we’ve run up against recently in the lab is actually performing useful analyses on our phylogenomic datasets after the main evolutionary analyses. That is, once you’ve assembled, aligned, built phylogenies and measured interesting things like divergence rates and dN/dS – how do you display and explore the collected set of loci?

I’ll give you an example – suppose we have a question like “what is the correlation between tree length and mean dS rate in our 1,000 loci?” It’s fairly simple to spit the relevant statistics out into text files and then cat them together for analysis in R (or Matlab, but please, for the love of God, not Excel…). And so far, this approach has got a lot of work done for us.


But suppose we wanted to change the question slightly – maybe we notice some pattern when eyeballing a few of the most interesting alignments – and instead ask something like “what is the correlation between tree length and mean dS rate in our 1,000 loci, where polar residues are involved with a sitewise log-likelihood of ≤ 2 units?” Or <3? or <10? Most of this information is already available at runtime, parsed or calculated. We could tinker with our collation script again, re-run it, and do another round of R analyses; but this is wasteful on a variety of levels. It would be much more useful to interact with the data on the server while it’s in memory. We’ve been starting to look at ways of doing this, and that’s for another post. But for now there’s another question to solve before that becomes useful, namely – how do you execute a GUI on a server but run it on a client?

There are a variety of bioinformatics applications (Geneious; Galaxy; CLCBio) with client-server functionality built-in, but in each case we’d have to wrap our analysis in a plugin layer or module in order to use it. Since we’ve already got the analysis in this case, that seems unnecessary, so instead we’re trialling X-windows. This is nothing to do with Microsoft Windows. It’s just a display server which piggybacks an SSH connection to display UNIX Quartz (display) device remotely. It sounds simple, and it is!

To set up an x-windowing session, we just have to first set up the connection to the bioinformatics server:

ssh -Y user@host.ac.uk

Now we have a secure x-window connection, our server-based GUI will display on the client if we simply call it (MDK.jar):

User:~ /usr/bin/java -jar ~/Documents/dev-builds/mdk/MDK.jar

Rather pleasingly simple, yes? Now there’s a few issues with this, not least a noticeable cursor lag. But – remembering that the aim here is simply to interact with analysis data held in memory at runtime, rather than any do complicated editing – we can probably live with that, for now at least. On to the next task!

(also published at evolve.sbcs)

Our Nature paper! Genome-wide molecular convergence in echolocating mammals

Exciting news from the lab this week… we’ve published in one of the leading journals, Nature!!!

Much of my work in the Rossiter BatLab for the last couple of years has centred around the search for genomic signatures of molecular convergence. This means looking for similar genetic changes in otherwise unrelated organisms. We’d normally expect unrelated organisms to differ considerably in their genetic sequences, because over time random mutations occur in their genomes; the more time has passed since two species diverged, the more changes we expect. However, we know that similar structures may evolve in unrelated species due to shared selection pressures (think of the streamlined body shapes of sharks, icthyosaurs and dolphins, for example). Can these pressures produce identical changes right down at the level of genetic sequences? We hoped to detect identical genetic changes in unrelated species (in this case, the echolocation – ‘sonar hearing’ – in some species of bats and whales) caused by similar selection pressures operating on the evolution of the genes required for those traits.

It’s been a long slog – we’ve had to write a complicated computer program to look at millions of letters of DNA – but this week it all bears fruit. We found that a <em>staggering</em> number of genes in the genomes of echolocating bats and whales (a bottlenose dolphin, if you must) showed evidence of these similar genetic changes, known technically as ‘genetic convergence’.

Obviously we started jumping up and down when we found this, and because we imagined other scientists would too, we wrote up our findings and sent them to the journal <em>Nature</em>, one of the top journals in the world of science… and crossed our fingers.

Well, today we can finally reveal that we were able to get through the peer-review process (where anonymous experts scrutinise your working – a bit like an MOT for your experiments), and the paper is out today!

But what do we actually say? Well:
<blockquote>Evolution is typically thought to proceed through divergence of genes, proteins and ultimately phenotypes. However, similar traits might also evolve convergently in unrelated taxa owing to similar selection pressures. Adaptive phenotypic convergence is widespread in nature, and recent results from several genes have suggested that this phenomenon is powerful enough to also drive recurrent evolution at the sequence level. Where homoplasious substitutions do occur these have long been considered the result of neutral processes. However, recent studies have demonstrated that adaptive convergent sequence evolution can be detected in vertebrates using statistical methods that model parallel evolution, although the extent to which sequence convergence between genera occurs across genomes is unknown. Here we analyse genomic sequence data in mammals that have independently evolved echolocation and show that convergence is not a rare process restricted to several loci but is instead widespread, continuously distributed and commonly driven by natural selection acting on a small number of sites per locus. Systematic analyses of convergent sequence evolution in 805,053 amino acids within 2,326 orthologous coding gene sequences compared across 22 mammals (including four newly sequenced bat genomes) revealed signatures consistent with convergence in nearly 200 loci. Strong and significant support for convergence among bats and the bottlenose dolphin was seen in numerous genes linked to hearing or deafness, consistent with an involvement in echolocation. Unexpectedly, we also found convergence in many genes linked to vision: the convergent signal of many sensory genes was robustly correlated with the strength of natural selection. This first attempt to detect genome-wide convergent sequence evolution across divergent taxa reveals the phenomenon to be much more pervasive than previously recognized.</blockquote>
Congrats to Steve, Georgia and Joe! After a few deserved beers we’ll have our work cut out to pick through all these genes and work out exactly what all of them do (guessing the genes’ biological functions, especially in non-model (read:not us or things we eat) organisms like bats and dolphins is notoriously tricky. So we’ll probably stick our heads out of the lab again in <em>another</em> two years…

The full citation is: Parker, J., Tsagkogeorga, G., Cotton, J.A.C., Liu, R., Stupka, E., Provero, P. &amp; Rossiter, S.J. (2013) Genome-wide signatures of convergent evolution in echolocating mammals. <em>Nature</em> (epub ahead of print), 4th September 2013. doi:10.1038/nature12511. This work was funded by Biotechnology and Biological Sciences Research Council (UK) grant BB/H017178/1.

&nbsp;

Molecular epidemiology and phylogeny reveals complex spatial dynamics of endemic canine parvovirus.

J Virol. 2011 May 18. [Epub ahead of print]

Clegg SR, Coyne KP, Parker J, Dawson S, Godsall SA, Pinchbeck G, Cripps PJ, Gaskell RM, Radford AD.

Canine parvovirus 2 (CPV-2) is a severe enteric pathogen of dogs, causing high mortality in unvaccinated dogs. After emerging, CPV-2 spread rapidly worldwide. However, there is now some evidence to suggest that international transmission appears to be more restricted. In order to investigate the transmission and evolution of CPV-2 both nationally and in relation to the global situation, we have used a long range PCR to amplify and sequence the full VP2 gene of 150 canine parvoviruses obtained from a large cross-sectional sample of dogs presenting with severe diarrhoea to veterinarians in the UK, over a two year period. Amongst these 150 strains, 50 different DNA sequence types were identified, and apart from one case, all appeared unique to the UK. Phylogenetic analysis provided clear evidence for spatial clustering at the international level, and for the first time also at the national level, with the geographical range of some sequence types appearing to be highly restricted within the UK. Evolution of the VP2 gene in this dataset was associated with a lack of positive selection. In addition, the majority of predicted amino acid sequences were identical to those found elsewhere in the world, suggesting CPV VP2 has evolved a highly fit conformation. Based on typing systems using key amino acid mutations, 43% of viruses were CPV 2a, 57% CPV 2b, with no type 2 or 2c found. However phylogenetic analysis suggested complex antigenic evolution of this virus, with both type 2a and 2b viruses appearing polyphyletic. As such, typing based on specific amino acid mutations may not reflect the true epidemiology of this virus. The geographical restriction we observed both within the UK, and between the UK and other countries, together with the lack of CPV-2c in this population, strongly suggest the spread of CPV within its population may be heterogeneously subject to limiting factors. This cross-sectional study of national and global CPV phylogeographic segregation reveals a substantially more complex epidemic structure than previously described.

Generation of neutralizing antibodies and divergence of SIVmac239 in cynomolgus macaques following short-term early antiretroviral therapy.

PLoS Pathog. 2010 Sep 2;6(9):e1001084.
Ozkaya Sahin G, Bowles EJ, Parker J, Uchtenhagen H, Sheik-Khalil E, Taylor S, Pybus OG, Mäkitalo B, Walther-Jallow L, Spångberg M, Thorstensson R, Achour A, Fenyö EM, Stewart-Jones GB, Spetz AL.

Neutralizing antibodies (NAb) able to react to heterologous viruses are generated during natural HIV-1 infection in some individuals. Further knowledge is required in order to understand the factors contributing to induction of cross-reactive NAb responses. Here a well-established model of experimental pathogenic infection in cynomolgus macaques, which reproduces long-lasting HIV-1 infection, was used to study the NAb response as well as the viral evolution of the highly neutralization-resistant SIVmac239. Twelve animals were infected intravenously with SIVmac239. Antiretroviral therapy (ART) was initiated ten days post-inoculation and administered daily for four months. Viral load, CD4(+) T-cell counts, total IgG levels, and breadth as well as strength of NAb in plasma were compared simultaneously over 14 months. In addition, envs from plasma samples were sequenced at three time points in all animals in order to assess viral evolution. We report here that seven of the 12 animals controlled viremia to below 10(4) copies/ml of plasma after discontinuation of ART and that this control was associated with a low level of evolutionary divergence. Macaques that controlled viral load developed broader NAb responses early on. Furthermore, escape mutations, such as V67M and R751G, were identified in virus sequenced from all animals with uncontrolled viremia. Bayesian estimation of ancestral population genetic diversity (PGD) showed an increase in this value in non-controlling or transient-controlling animals during the first 5.5 months of infection, in contrast to virus-controlling animals. Similarly, non- or transient controllers displayed more positively-selected amino-acid substitutions. An early increase in PGD, resulting in the generation of positively-selected amino-acid substitutions, greater divergence and relative high viral load after ART withdrawal, may have contributed to the generation of potent NAb in several animals after SIVmac239 infection. However, early broad NAb responses correlated with relatively preserved CD4(+) T-cell numbers, low viral load and limited viral divergence.

Full-Length Characterization of Hepatitis C Virus Subtype 3a Reveals Novel Hypervariable Regions under Positive Selection during Acute Infection

Humphreys I, Fleming V, Fabris P, Parker J, Schulenberg B, Brown A, Demetriou C, Gaudieri S, Pfafferott K, Lucas M, Collier J, Huang KH, Pybus OG, Klenerman P, Barnes E.

J Virol. 2009 Nov;83(22):11456-66. Epub 2009 Sep 9.

Hepatitis C virus subtype 3a is a highly prevalent and globally distributed strain that is often associated with infection via injection drug use. This subtype exhibits particular phenotypic characteristics. In spite of this, detailed genetic analysis of this subtype has rarely been performed. We performed full-length viral sequence analysis in 18 patients with chronic HCV subtype 3a infection and assessed genomic viral variability in comparison to other HCV subtypes. Two novel regions of intragenotypic hypervariability within the envelope protein E2, of HCV genotype 3a, were identified. We named these regions HVR495 and HVR575. They consisted of flanking conserved hydrophobic amino acids and central variable residues. A 5-amino-acid insertion found only in genotype 3a and a putative glycosylation site is contained within HVR575. Evolutionary analysis of E2 showed that positively selected sites within genotype 3a infection were largely restricted to HVR1, HVR495, and HVR575. Further analysis of clonal viral populations within single hosts showed that viral variation within HVR495 and HVR575 were subject to intrahost positive selecting forces. Longitudinal analysis of four patients with acute HCV subtype 3a infection sampled at multiple time points showed that positively selected mutations within HVR495 and HVR575 arose early during primary infection. HVR495 and HVR575 were not present in HCV subtypes 1a, 1b, 2a, or 6a. Some variability that was not subject to positive selection was present in subtype 4a HVR575. Further defining the functional significance of these regions may have important implications for genotype 3a E2 virus-receptor interactions and for vaccine studies that aim to induce cross-reactive anti-E2 antibodies.

The within- and among-host evolution of chronically-infecting human RNA viruses

A research thesis submitted for the degree of Doctor of Philosophy at the University of Oxford.

J Parker

Funded by: Natural Environment Research Council (UK) with support from Linacre College, Oxford.

Abstract: This thesis examines the evolutionary biology of the RNA viruses, a diverse group of pathogens that cause significant diseases. The focus of this work is the relationship between the processes driving the evolution of virus populations within individual hosts and at the epidemic level.

First, Chapter One reviews the basic biology of RNA viruses, the current state of knowledge in relevant topics of evolutionary virology, and the principles that underlie the most commonly used methods in this thesis.

In Chapter Two, I develop and test a novel framework to estimate the significance of phylogeny-trait association in viral phylogenies. The method incorporates phylogenetic uncertainty through the use of posterior sets of trees (PST) produced in Bayesian MCMC analyses.

In Chapter Three, I conduct a comprehensive analysis of the substitution rate of hepatitis C virus (HCV) in within- and between-host data sets using a relaxed molecular clock. I find that within-host substitution rates are more rapid than previously appreciated, that heterotachy is rife in within-host data sets, and that selection is likely to be a primary driver.

In Chapter Four I apply the techniques developed in Chapter Two to successfully detect compartmentalization between peripheral blood and cervical tissues in a large data set of human immunodeficiency virus (HIV) patients. I propose that compartmentalization in the cervix is maintained by selection.

I extend the framework developed in Chapter Two in Chapter Five and explore the Type II error of the statistics used.

In Chapter Six I review the findings of this thesis and conclude with a general discussion of the relationship between within- and among-host evolution in viruses, and some of the limitations of current techniques.