Tag Archives: bioinformatics

HYPHY Hack: Passing arguments to HYPHY for phylogenetics using the command-line

Important update, 2017-Feb-07 ]

This solution, already a bit hacky, should now be considered a last-resort. Sergei and colleague Stephen Weaver have suggested a much more elegant solution; see: https://github.com/veg/hyphy/issues/522You’ll still have to dive into the batch file you want to iterate over (to work out what user options are presented, in which order) but you should not have to edit the batch files themselves directly. The solution below may no longer work for some versions of HyPhy, owing to altered fscanf() behaviour. ]

HYHPY, is a great platform for advanced phylogenetics by Sergei L. Kosakovsky Pond, Simon D. W. Frost and Spencer V. Muse, where abstract concepts such as likelihood-ratio tests, model selection, and phylogenetic inference are represented and manipulated by means of a powerful and flexible object-oriented language called Hyphy Batch Language, or HBL, using workflows known as ‘batch files’ (actually more like routines). A large number (around a thousand) publications to date have made use of HYPHY, which includes additional features such as a GUI and ready-to-use implementations of advanced published methods. It also powers the datamonkey.org online phylogenetics server.

However, for all this flexibility, HYPHY actually has an ugly side: Because the batch file system is so central to operations, there isn’t a convenient way to send pass arguments to HYPHY via the command-line. Yes, there are plenty of ways to get data into HYPHY at or before runtime (hard-coded options; reading in config files; dialog prompts on the command-line or GUI), but none that correspond to a standard POSIX-style program argument. In a phylogenomics context this caused our group some problems…

The problem

Let’s suppose we have a set of loci (perhaps a few thousand), with different names. An earlier pipeline has produced a set of subdirectories, one for each locus, with an alignment file and a phylogenetic tree in each. Say we want to run the same positive selection test (I’ll assume the branch-site random-effects likelihood test for this post, implemented already in HYPHY as the BranchSiteREL.bf batch file) on each in HYPHY – how can we do that? We have a few options:

  1. Run HYPHY in GUI mode: This has the advantage of being easy to do. But it’s incredibly demanding of human input – who’s going to sit and click through thousands of HYPHY sessions? This input will also make it slower (depending on the analysis, the human component might be the limiting step); and it will certainly introduce the potential for human errors.
  2. Create a custom HYPHY batch file, and rename the input files in each locus: In other words, a script which looks for input files named something like ‘input.fasta‘ and ‘input.tre‘, and executes them. Unfortunately, there’s a risk we might over-write files we don’t want to, if one or more HYPHY calls fail part-way through. It could also be hard to parallelise this.
  3. Create a custom HYPHY batch file to loop through the input directories: This is how we probably ought to do things natively in the ‘HYPHY way’ – HBL is powerful enough to let us do things like read directory contents, split and test and generally manipulate strings etc. So we could probably work out how to write a wrapper batch file in HBL for HYPHY that would call BranchSiteREL.bf . But do we really want to delve deeply into yet another language just to do that? And suppose we wanted to run the same analysis on another set of data in a month or so – we’d have to edit the wrapper file to loop through a different directory…
  4. What we really want to do is pass arguments to HYPHY using the command-line: That is, we want to be able to use the STDIN standard input stream to pass the input alignment and phylogeny files’ paths to HYPHY, read them into BranchSiteREL.bf  as variables, and execute the batch file with no further input. This method will be flexible – we can use any paths we want, and change them at any time – and modular because we won’t have lots of different BranchSiteREL.bf files sitting about for analyses at different times, just one.

It turns out that it’s actually pretty easy to do this – it took me an hour or so to work it out, and a couple more for implementation and testing – and with this guide you should be able to do it far quicker. There are several steps:

  1. Refactor the existing batch file to collect key variables
  2. Edit batch file to read variables from STDIN
  3. Call HYPHY in command-line mode, passing variables in-place as a ‘here’ string

That’s it! Here are the steps in detail:

1. Refactor the existing batch file to collect key variables

(NB: links to my hacked copies further down this page)

If you’re not familiar with HYPHY (and if you were, you probably wouldn’t be interested in this hack), this will be the intimidating bit. But relax: if you know C, Java, Perl, or any modernish procedural language, this is easy.

What we want to do is take the existing standard analysis batch file which came with HYPHY, BranchSiteREL.bf, and work out all the places where HYPHY expects user input. We’ll need to either hardcode those, or pass variables from the command-line. To make this less likely to break, we’re going to a) work on a copy of the batch file (mine’s called BranchSiteREL_joeHack.bf), and b) refactor the code so all those variables are initialised right at the start of the batch file, where we can see them.

To start with, run the batch file in GUI mode as normal. This lets you check the input files are actually formatted correctly. Also note down all the points where the script asks for input, and what you want those inputs to be. In the REL test, the steps are: pick genetic code (‘universal’); input alignment (‘hyphy-input.fasta’); input phylogeny (‘hyphy-input.tre’); and output file (‘hyphy-output.REL’ but really, output file prefix – there’s several outputs in fact, which will share this prefix). Now we can go to the head of the copied BranchSiteREL_joeHack.bf file, and set these variables up. To start with, we’ll hardcode them. Later, we’ll read them from the command line via standard input. I’ve used ALL_CAPS variables for readability, not that HBL cares:

/* Variables we'll define and later set by STDIN */
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_GENETIC_CODE = 1;
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";

/* Start of normal batch file */
skipCodeSelectionStep = 0;
LoadFunctionLibrary("chooseGeneticCode_HardcodeUniversal");

LoadFunctionLibrary("GrabBag");
LoadFunctionLibrary("dSdNTreeTools");
LoadFunctionLibrary("CF3x4");
LoadFunctionLibrary("BranchSiteTemplate");
...

So the four variables we’ve introduced are: JOE_HARDCODE_ALIGNMENT; JOE_HARDCODE_PHYLOGENY; JOE_HARDCODE_GENETIC_CODE; and JOE_HARDCODE_OUTPUT. We’ve defined these, but they’re not actually used anywhere yet – as things stand, HYPHY will still try and ask the user for input. What we need to do instead is go through the batch file looking for methods that prompt the user for input, and replace them with our variables instead. From a quick read of the HBL documentation (nb, the HTML documentation that comes with HYPHY is more useful), there seem to be two main ways HYPHY gets user input. They are:

/* fscanf() - reads input to a variable, e.g from console (command-line) to a string, as here: */
fscanf(stdin,"String",SOME_VARIABLE);
/* PROMPT_FOR_FILE, a special variable that opens a system dialog/file chooser, as here: */
DataSet ds = ReadDataFile(PROMPT_FOR_FILE);

All we need to do is look through the batch files and the places where the user interactions we noted in our GUI session happened, and replace the fscanf()‘s or PROMPT_FOR_FILE‘s with our variables. Then when we change the variables from being hardcoded to being passed as arguments at the command-prompt, we’ll have our complete program. In the case of BranchSiteREL.bf, there are in fact a number of included scripts (additional batch files or model definition files) used in the analysis – so in some cases we need to change those too. Make sure to use copies and rename them…

The datafile (alignment)
This is found in BranchSiteREL.bf:11, as above. This line is easy to find and change:

11
12
13
14
15
16
17
DataSet ds = ReadDataFile(PROMPT_FOR_FILE);
/* Change PROMPT_FOR_FILE
to our initialised JOE_HARDCODE_ALIGNMENT

Make sure to _replace_ 'PROMPT_FOR_FILE'
or comment out the old line if you want to copy it! */

DataSet ds = ReadDataFile(JOE_HARDCODE_ALIGNMENT);

The output files’ prefix
This is found in BranchSiteREL.bf:47, as above. Also easy, although PROMPT_FOR_FILE is used in an odd context:

46
47
48
49
SetDialogPrompt ("Save analysis results to");
fprintf (PROMPT_FOR_FILE, CLEAR_FILE, KEEP_OPEN,"Branch,Mean_dNdS,Omega1,P1,Omega2,P2,Omega3,P3,LRT,p,p_Holm")
/* Replace PROMPT_FOR_FILE with JOE_HARDCODE_OUTPUT */
fprintf (JOE_HARDCODE_OUTPUT, CLEAR_FILE, KEEP_OPEN,"Branch,Mean_dNdS,Omega1,P1,Omega2,P2,Omega3,P3,LRT,p,p_Holm");

The tree (phylogeny)
Annoyingly, this is found in a required batch file, not the main one. It’s found in queryTree.bf, so we need to locate this file, rename it, edit it, and also edit the place where it is called so that our hacked version is called instead. queryTree.bf itself is found in the same directory (TemplateBatchFiles) as BranchSiteREL.bf. I copied it to queryTree_hardcode.bf. Within this the relevant line is queryTree.bf:59, with a similar syntax to the output file:

55
56
57
58
59
60
61
62
63
if (!IS_TREE_PRESENT_IN_DATA)
{
SetDialogPrompt ("Please select a tree file for the data:");

fscanf (PROMPT_FOR_FILE, REWIND, "Raw", treeString);
/* As before, replace PROMPT_FOR FILE
with our phylogeny variable. In my case,
JOE_HARDCODE_PHYLOGENY*/

fscanf (JOE_HARDCODE_PHYLOGENY, REWIND, "Raw", treeString);

Because this is an external function library, we need to find where in BranchSiteREL.bf it’s imported, and make sure our hacked copy is instead. We need BranchSiteREL.bf:44

44
45
46
47
LoadFunctionLibrary ("queryTree");
/* Replace with our queryTree_hardcode.bf
(the *.bf suffix isn't needed) */

LoadFunctionLibrary ("queryTree_hardcode");

The genetic code translation definitions
The genetic code translation type is also handled in an external library, chooseGeneticCode.def, but annoyingly, this isn’t in TemplateBatchFiles, but a TemplateBatchFiles/TemplateModels subdirectory. Such is life… again, I’ve worked on a copy, chooseGeneticCode_HardcodeUniversal.def, and after modifying the library itself we need to edit the library call to make sure our hacked version is pulled in. First, the edit, which uses a slightly different, but still intuitive syntax, found at chooseGeneticCode.def:95:

95
96
97
98
99
100
101
102
103
104
105
106
107
108
if (!skipCodeSelectionStep)
{
/* this is where the user input routine ChoiceList() is called... */
ChoiceList (modelType,"Choose Genetic Code",1,SKIP_NONE,_geneticCodeOptionMatrix);

if (modelType < 0)
{
return;
}
/* but this is where the variable is actually set... */
ApplyGeneticCodeTable (modelType);
/* ... so we'll replace modelType with our global JOE_HARDCODE_GENETIC_CODE variable */
ApplyGeneticCodeTable (JOE_HARDCODE_GENETIC_CODE);
}

The corresponding call to TemplateModels.chooseGeneticCode.def in BranchSiteREL.bf is right back at line 2:

1
2
3
4
5
skipCodeSelectionStep = 0;
LoadFunctionLibrary("chooseGeneticCode");
/* Replace the default library with our hacked one -
Note that the subdirectory path isn't needed; the TemplateModels subdirectory is searched by default */

LoadFunctionLibrary("chooseGeneticCode_HardcodeUniversal");

 

2. Edit batch file to read variables from STDIN

Phew! Good news is that was the fiddly bit; the rest of this is all easy. The next step is to replace the hardcoded variable initalisations at the head of our BranchSiteREL.bf copy with fscanf() methods that will assign values to these variables from the standard input (command-line). So we’ll comment out:

1
2
3
4
5
6
7
8
/* Variables we'll define and later set by STDIN */
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_GENETIC_CODE = 1;
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";
/* Start of normal batch file */
skipCodeSelectionStep = 0;
...

And replace them with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Variables we'll define and later set by STDIN */
/* comment out the hardcoded definitions ...
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_GENETIC_CODE = 1;
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";

And replace with stdin read via fscanf(): */

fscanf(stdin,"String",JOE_HARDCODE_ALIGNMENT);
fscanf(stdin,"String",JOE_HARDCODE_PHYLOGENY);
fscanf(stdin,"String",JOE_HARDCODE_OUTPUT);
JOE_HARDCODE_GENETIC_CODE = 1; // OK, we'll keep this one hardcoded for now
/* Start of normal batch file */
skipCodeSelectionStep = 0;
...

These are pretty self-explanatory. Done!

3. Call HYPHY in command-line mode, passing variables in-place as a ‘here’ string

At this point, all we’ve really done is refactor the batch file. We’ve moved where the variables are initalised / set, so that we can find them easily, and we’ve called fscanf() on each them in order to set them. So far, because the implies someone, somehow, will need to type stuff into stdin at a prompt, this doesn’t actually solve our main problem – how to pass variables on the command line to HYPHY – but what it has done is made everything a lot neater. Note that these are still three separate calls to fscanf(), however – which means HYPHY will expect three discrete chunks of user interaction. In a nutshell, if we ran HYPHY now, we’d get something like:

>HYPHY: Please choose a data file:
me: /foo/bar/hyphy_input.fa

>HYPHY: Please select a tree:
me: /foo/bar/hyphy_input.tre

>HYPHY: Please choose a file for output:
me: /foo/bar/hyphy_output.REL

So we need to get bash to accept input from a file or command-line, and pass it onto HYPHY each time HYPHY wants input. The easy way to do this is to put each user response on a separate line in a shell.sh file, and use the ‘<‘ switch to redirect the standard input stream to this file, instead of the keyboard. This might look a bit like:

# in: commands.sh
hyphy-input.fasta # the alignment
hyphy-input.tre # the tree
hyphy-output.REL #the output

# HYPHYMP (the binary) could then be called with:
$user~: HYPHYMP BranchSiteREL_joeHack.bf &lt; commands.sh

But that wouldn’t really help us, would it? We’d have to edit commands.sh separately for each locus! Luckily there is a handy Bash trick which I had to search for a bit – the ‘here’ string (I found this on LinuxJournal). This lets us redirect a string in-place to the command-line, and takes the form:

$user~: command <<<'input_string_to_stdin'

Remembering that we had three fscanf() calls, one for each of our refactored variables, we’ll need three inputs. No problem (StackExchange to the rescue) – we can separate the inputs with newline (‘\n’) characters (we’ll also need the ‘$’ operator, to make sure bash interprets the newlines correctly), like this:

$user~: command <<<$'input_1\ninput_2\ninput_3'

This syntax is equivalent to giving the command command three separate and sequential inputs.

Putting it all together

Finally we’ve got everything we need to run HYPHY in command-line mode. To recap:

  • A command-line friendly version of HYPHY (see this post);
  • The edited versions of BranchSiteREL.bf, chooseGeneticCode.def and queryTree.bf, renamed and in place alongside their original copies;
  • Input alignment and tree files, and a writeable output directory;
  • A means (the ‘here’ or ‘<<<‘ operator) of sending multiple-line inputs to the standard input stream.

Running HYPHY on the command line with arguments passed

Let’s do this! There are a couple of additional options (CPU=integer, which sets the number of cores, and BASEPATH=/path/to/batchfiles, which ensures the right batchfile directory is being used) but don’t worry about those for now.

The complete command is :

/usr/local/bin/HYPHYMP CPU=number_of_cpu_cores BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ BranchSiteREL_joeHack.bf &lt;&lt;&lt;$'/path/to/hyphy_input.fa\n/path/to/hyphy_input.tre\n/path/to/hyphy_output.REL'

You can optionally use stuff like >log.out and 2>log.err to redirect STDOUT and STDERR if you want; also & to fork and leave running etc. But the critical bit of this command is the last bit, after the ‘<<<‘ handle. I’ve only tested this using absolute/full pathnames for the input/output file arguments – it’s a pain but less likely to break in the short-term (what happens if you move the whole project folder is another matter…)

I admit this looks absolutely horrible. But it’s the best I can do.

In practice

So for me (user=jparker) working from /Downloads/hyphy_hacks/hackinput with alignments hyphy-input.fa and hyphy-input.tre, and outputting to files with prefix run2, the complete command is:

/usr/local/bin/HYPHYMP CPU=2 BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ BranchSiteREL_joeHack.bf &lt;&lt;&lt;;$'/home/jparker/Downloads/hyphy_hacks/hackinput/hyphy-input.fa\n/home/jparker/Downloads/hyphy_hacks/hackinker/Downloads/hyphy_hacks/hackinput/run2'

And if I don’t want to wait for it to complete, and send stdout and stderr to some files, the command is:

/usr/local/bin/HYPHYMP CPU=2 BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ BranchSiteREL_joeHack.bf &lt;&lt;&lt;$'/home/jparker/Downloads/hyphy_hacks/hackinput/hyphy-input.fa\n/home/jparker/Downloads/hyphy_hacks/hackinker/Downloads/hyphy_hacks/hackinput/run4' &gt;run4.stdout 2&gt;run4.err &amp;

Lastly you can change the argument to the CPU= command if you want to. Be aware that by default HYPHYMP uses as many cores as it can see (I think)…

Reviewing FastCodeML, a potential replacement for PAML’s codeml

(first posted at SBCS-Evolve)

Valle and colleagues have just released FastCodeML, an application for the detection of positive selection in DNA datasets. This is a successor SlimCodeML, which is itself an improved implementation of the codeml program from Ziheng Yang’s workhorse phylogenetic analysis package PAML, which continues to be extremely popular, moving through several versions and has at long last acquired a GUI . Note ‘analysis’ – these programs are not intended for phylogeny inference (finding phylogenetic trees from data, although they will do a rough search if required). In fact, for many people the main program of choice for phylogeny inference itself is RAxML, which is also produced by the same lab (Stamatakis). So – given this lab’s pedigree, I was interested to see how FastCodeML compares to codeml, the implementation the authors seek to replace.

Installation

To install FastCodeML you’ll find a set of tar’ed distributions at their FTP site, ftp://ftp.vital-it.ch/tools/FastCodeML/. It wasn’t immediately clear which is the most recent build, but I downloaded both a June 2013 version and the 1.01 release just in case. This includes a precompiled binary for Linux (we’re on CentOS for our dev cluster) and as I was feeling lazy keen to test the performance of the un-optimised binary that ships with the package, I used this. However a simple configure / make / make check / make install will do the job as well. I had to fiddle with a couple of permissions, but that’s all stuff unique to our dev cluster and common to everything we install, so it’s hardly fair to penalise FastCodeML for that…

Interface

FastCodeML, like RAxML, uses arguments passed to the command-line at runtime rather than a GUI (like, say HYPHY or Garli) or a configuration script (like PAML or BEAST.) As a result (assuming you’ve added the FastCodeML executable, fast, to your $PATH) you can paste and go, e.g.:

fast -m 22 -bf -hy 1 ./<tree-file> <sequence-file>

I’m ambivalent about passing arguments on the command line for phylogenetic analyses. Let’s look at the pros and cons: A GUI is more attractive to many users, but with the important caveat that it is well designed, otherwise the useability traded off against performance / flexibility will be degraded. A lot of phylogeneticists are used to working without GUIs and (more importantly) crunching large genomic datasets these days; so the lack of a GUI isn’t really a problem.

What about config files, the approach taken by the original PAML package? Well, they are fiddly at times, but there are a number of advantages. Firstly, the config file itself is a useful aide-memoire, and default parameter values can be suggested. Secondly, and perhaps most importantly in a phylogenomic context, the config file acts as a record of the analysis settings. It can be easier to debug pipelines with config files as well. However, a drawback of config files is their size, and in a pipeline context relying on another file existing where and when you need it can be risky. Passing arguments on the other hand avoids this, and also allows you to work more efficiently with pipes.

About the most significant drawback of the pass-by-argument approach is that as the number of parameters/options increases, keeping track of your own intentions becomes much harder. For example, running a BEAST analysis using arguments would be virtually impossible. In our pipeline API I coded a wrapper for RAxML where the various options could be laid out clearly in the comments, and various operation modes defined as enumerated types; but despite years of using RAxML I still forget certain command-line options from time-to-time if running single analyses. Overall, though, for simple applications like RAxML, MUSCLE or FastCodeML, I think I prefer command-line arguments as simpler and cleaner. Hopefully feature creep won’t bloat the option set of future releases.

Output

The output from FastCodeML is pleasingly clean and simple:

Doing branch 0

LnL1: -2393.120011398291808 Function calls: 1259

  0.1982869  0.1727072  0.2111106  0.0741396  0.0443093  0.0160066  0.0102474  0.0050577
  0.0084463  0.0254361  0.1328334  0.0754612

p0:  0.4817193  p1:  0.5182807  p2a:  0.0000000  p2b:  0.0000000
w0:  0.0000010  k:   5.1561766  w2:   1.1258227

This contrasts with several screens’ worth of ramblings from PAML, on both stdout and output streams (even with verbose disabled). To get the same data from PAML you have to dig about a bit in the stdout or results file, or else look at rst and find:

dN/dS (w) for site classes (K=3)

site class             0         1         2
proportion        0.55182   0.32889   0.11929
branch type 0:    0.05271   1.00000   3.43464
branch type 1:    0.05271   1.00000   0.00000

The two methods are broadly in agreement, to a first approximation: both the proportions and omegas (dN/dS) estimated are fairly similar. I wasn’t too worried about the numerical value of the estimates here, as there are subtle differences in the implementation of both.

But what about their speed? This is the major selling-point of FastCodeML after all; there are many competing methods to detect natural selection but the pedigree of Stamatakis and colleagues promises a substantial improvement over PAML. There are two considerations here: wall-clock time, the time the calculation appears to take to the user (most relevant to single local analyses); and  CPU time, which excludes I/O and time the thread spends sleeping or suspended – for cluster work this is far more important since a majority of jobs will be scheduled and CPU time rationed.

Well, according to POSIX::time(), PAML does fairly well:

real	0m50.872s
user	0m50.781s
sys	0m00.006s

But what about FastCodeML?

real	0m05.641s
user	1m22.698s
sys	0m01.158s

Pow. The important thing to note here is that the speed-up is better across both CPU/real time and wall clock/user time; but that the improvement factor is much greater for CPU (around x2 better for wall clock time, vs. x10 for CPU time.) This is extremely promising and suggests that cluster/pipeline operations should be significantly faster using FastCodeML…

 

Conclusions

If I’d been bothered enough to use some sort of boxing analogy throughout this post, I’d be using metaphors like ‘sucker punch’ and ‘knockout blow’ by now. FastCodeML really is fast, and impressively so. It also offers a reduced filesystem footprint and cleaner I/O, which will likely make it ideal for cluster/pipeline work. Assuming the accuracy and specificity of its implementation are at least comparable to PAML (but that’s a much harder thing to test) I’m likely to use FastCodeML for most analyses of natural selection in future.

Using x-windows to render GUIs for server-client phylogenomics

The power of modern computers allied to high-throughput next-generation sequencing will usher in a new era of quantitative biology that will deliver the moon on a stick to every researcher on Earth…

Well, quite.

But something we’ve run up against recently in the lab is actually performing useful analyses on our phylogenomic datasets after the main evolutionary analyses. That is, once you’ve assembled, aligned, built phylogenies and measured interesting things like divergence rates and dN/dS – how do you display and explore the collected set of loci?

I’ll give you an example – suppose we have a question like “what is the correlation between tree length and mean dS rate in our 1,000 loci?” It’s fairly simple to spit the relevant statistics out into text files and then cat them together for analysis in R (or Matlab, but please, for the love of God, not Excel…). And so far, this approach has got a lot of work done for us.


But suppose we wanted to change the question slightly – maybe we notice some pattern when eyeballing a few of the most interesting alignments – and instead ask something like “what is the correlation between tree length and mean dS rate in our 1,000 loci, where polar residues are involved with a sitewise log-likelihood of ≤ 2 units?” Or <3? or <10? Most of this information is already available at runtime, parsed or calculated. We could tinker with our collation script again, re-run it, and do another round of R analyses; but this is wasteful on a variety of levels. It would be much more useful to interact with the data on the server while it’s in memory. We’ve been starting to look at ways of doing this, and that’s for another post. But for now there’s another question to solve before that becomes useful, namely – how do you execute a GUI on a server but run it on a client?

There are a variety of bioinformatics applications (Geneious; Galaxy; CLCBio) with client-server functionality built-in, but in each case we’d have to wrap our analysis in a plugin layer or module in order to use it. Since we’ve already got the analysis in this case, that seems unnecessary, so instead we’re trialling X-windows. This is nothing to do with Microsoft Windows. It’s just a display server which piggybacks an SSH connection to display UNIX Quartz (display) device remotely. It sounds simple, and it is!

To set up an x-windowing session, we just have to first set up the connection to the bioinformatics server:

ssh -Y user@host.ac.uk

Now we have a secure x-window connection, our server-based GUI will display on the client if we simply call it (MDK.jar):

User:~ /usr/bin/java -jar ~/Documents/dev-builds/mdk/MDK.jar

Rather pleasingly simple, yes? Now there’s a few issues with this, not least a noticeable cursor lag. But – remembering that the aim here is simply to interact with analysis data held in memory at runtime, rather than any do complicated editing – we can probably live with that, for now at least. On to the next task!

(also published at evolve.sbcs)

Befi-BaTS v0.1.1 alpha release

Long-overdue update for beta version of Befi-BaTS.

Software: Befi-BaTS

Author: Joe Parker

Version: 0.1.1 beta (download here)

Release notes: Befi-BaTS v0.1 beta drops support for hard polytomies (tree nodes with > 2 daughters), now throwing a HardPolytomyException to the error stack when these are parsed. This is because of potential bugs when dealing with topology + distance measures (NTI/NRI) of polytomies. These bugs will be fixed in a future release. The current version 0.1.1 improves #NEXUS input file parsing.

Befi-BaTS: Befi-BaTS uses two established statistics (the Association Index, AI (Wang et al., 2001), and Fitch parsimony score, PS) as well as a third statistic (maximum exclusive single-state clade size, MC) introduced by us in the BaTS citation, where the merits of each of these are discussed. Befi-BaTS 0.1.1 includes additional statistics that include branch length as well as tree topology. What sets Befi-BaTS aside from previous methods, however, is that we incorporate uncertainty arising from phylogenetic error into the analysis through a Bayesian framework. While other many other methods obtain a null distribution for significance testing through tip character randomization, they rely on a single tree upon which phylogeny-trait association is measured for any observed or expected set of tip characters.

Generation of neutralizing antibodies and divergence of SIVmac239 in cynomolgus macaques following short-term early antiretroviral therapy.

PLoS Pathog. 2010 Sep 2;6(9):e1001084.
Ozkaya Sahin G, Bowles EJ, Parker J, Uchtenhagen H, Sheik-Khalil E, Taylor S, Pybus OG, Mäkitalo B, Walther-Jallow L, Spångberg M, Thorstensson R, Achour A, Fenyö EM, Stewart-Jones GB, Spetz AL.

Neutralizing antibodies (NAb) able to react to heterologous viruses are generated during natural HIV-1 infection in some individuals. Further knowledge is required in order to understand the factors contributing to induction of cross-reactive NAb responses. Here a well-established model of experimental pathogenic infection in cynomolgus macaques, which reproduces long-lasting HIV-1 infection, was used to study the NAb response as well as the viral evolution of the highly neutralization-resistant SIVmac239. Twelve animals were infected intravenously with SIVmac239. Antiretroviral therapy (ART) was initiated ten days post-inoculation and administered daily for four months. Viral load, CD4(+) T-cell counts, total IgG levels, and breadth as well as strength of NAb in plasma were compared simultaneously over 14 months. In addition, envs from plasma samples were sequenced at three time points in all animals in order to assess viral evolution. We report here that seven of the 12 animals controlled viremia to below 10(4) copies/ml of plasma after discontinuation of ART and that this control was associated with a low level of evolutionary divergence. Macaques that controlled viral load developed broader NAb responses early on. Furthermore, escape mutations, such as V67M and R751G, were identified in virus sequenced from all animals with uncontrolled viremia. Bayesian estimation of ancestral population genetic diversity (PGD) showed an increase in this value in non-controlling or transient-controlling animals during the first 5.5 months of infection, in contrast to virus-controlling animals. Similarly, non- or transient controllers displayed more positively-selected amino-acid substitutions. An early increase in PGD, resulting in the generation of positively-selected amino-acid substitutions, greater divergence and relative high viral load after ART withdrawal, may have contributed to the generation of potent NAb in several animals after SIVmac239 infection. However, early broad NAb responses correlated with relatively preserved CD4(+) T-cell numbers, low viral load and limited viral divergence.

Full-Length Characterization of Hepatitis C Virus Subtype 3a Reveals Novel Hypervariable Regions under Positive Selection during Acute Infection

Humphreys I, Fleming V, Fabris P, Parker J, Schulenberg B, Brown A, Demetriou C, Gaudieri S, Pfafferott K, Lucas M, Collier J, Huang KH, Pybus OG, Klenerman P, Barnes E.

J Virol. 2009 Nov;83(22):11456-66. Epub 2009 Sep 9.

Hepatitis C virus subtype 3a is a highly prevalent and globally distributed strain that is often associated with infection via injection drug use. This subtype exhibits particular phenotypic characteristics. In spite of this, detailed genetic analysis of this subtype has rarely been performed. We performed full-length viral sequence analysis in 18 patients with chronic HCV subtype 3a infection and assessed genomic viral variability in comparison to other HCV subtypes. Two novel regions of intragenotypic hypervariability within the envelope protein E2, of HCV genotype 3a, were identified. We named these regions HVR495 and HVR575. They consisted of flanking conserved hydrophobic amino acids and central variable residues. A 5-amino-acid insertion found only in genotype 3a and a putative glycosylation site is contained within HVR575. Evolutionary analysis of E2 showed that positively selected sites within genotype 3a infection were largely restricted to HVR1, HVR495, and HVR575. Further analysis of clonal viral populations within single hosts showed that viral variation within HVR495 and HVR575 were subject to intrahost positive selecting forces. Longitudinal analysis of four patients with acute HCV subtype 3a infection sampled at multiple time points showed that positively selected mutations within HVR495 and HVR575 arose early during primary infection. HVR495 and HVR575 were not present in HCV subtypes 1a, 1b, 2a, or 6a. Some variability that was not subject to positive selection was present in subtype 4a HVR575. Further defining the functional significance of these regions may have important implications for genotype 3a E2 virus-receptor interactions and for vaccine studies that aim to induce cross-reactive anti-E2 antibodies.

Journal Impact Factors – a good free tool

(originally posted on Kitserve.org.uk)

Recently, I’ve taken on more consulting work outside my own immediate area. http://www.eigenfactor.org, a free impact factor tool, has been incredibly handy. Here’s why.

Getting to grips (well on some level, at least) with a new system is a bit exciting and not a little empowering, too – like the first time you really understood crystalization as a kid (remember those copper sulphate crystals in the jar?)

The problem is that journals always fall into four categories in my book;

  1. Top level ones like Science, Nature, PLoS and PNAS,
  2. Reviews and stuff that are usually a good place to start,
  3. Key articles in specialist journals, and
  4. Crapola which you don’t need to bother with (to start with, at least).

The trouble is, while 1, 2, and 4 pretty much find themselves, working out which journals to look in for the specialist stuff when you start in a new field is pretty hard. For instance, the Journal of General Virology, Journal of Virology, and Virology all deal, obviously, with viruses and their biology… but which is the more authoritative?

The Impact Factor

If you’ve trained as a scientist, you’re probably sagely muttering ‘impact factor’. If you’ve ever worked as one you’re probably screaming it.

So what is this ‘impact factor’? Sounds like something to do with ballistics. Basically it’s a measure of the amount of influence a given published scientific article has on other articles. Since an article’s authors reference (or ‘cite’) other articles from all kinds of journals and books for background information and to support their own assertions, it follows that an article considered to be important to professionals in a field will be cited more frequently than an irrelevant one.

So good articles are cited more frequently. That helps us find those (both http://pubmed.org and http://scholar.google.com will tell you how many times a given article has been cited). And it’s a fairly simple matter to aggregate the mean number of citations per article in a particular journal and express that as a ratio or percentile of others in its field (you can also apply the same process to deciding whether or not to hire a particular scientist if you’re an institution or funding body – a nasty and growing trend which explains the screaming mentioned above…)

Use your judgement

There is a whole set of complex arguments about the best way to do that, and I won’t go into them here, not least because in my opinion at the end of the day you should always use your own good professional judgement when evaluating an article’s importance – no impact factor can fully do that for you. Ask yourself:

  • Do I actually know enough about the area yet to work out in general what the hell this article means, let alone if it’s any good?
  • If the citations seem particularly high (or low) for this journal/authors/general quality of paper, am I missing something?

If the answer to the first question is ‘no’ you’d better go off and read a few more reviews…

Eigenfactor

Anyway, why am I boning eigenfactor.org so hard at the moment? Well, a couple of reasons really:

  1. It’s free
  2. It has good coverage, and
  3. A great search interface, which is simple to use.

There are a few other useful things about their interface and data filtering, but for me those are the three main reasons. The ‘free’ thing’s great, obviously. But I really like the coverage they have and search interface because it quickly lets me find my way into a subject – when you start typing a journal or discipline into the box it autocompletes for you really smoothly. Ace huh?

Applying this to our virology journals, we find that their impact factors differ quite widely:

  • J. Gen. Virol. – 3.092
  • J. Virol. – 5.332
  • Virology – 3.765

So the Journal of Virology is the winner! Cool. Now I’m off to bone up on vif gene inactivation…

The within- and among-host evolution of chronically-infecting human RNA viruses

A research thesis submitted for the degree of Doctor of Philosophy at the University of Oxford.

J Parker

Funded by: Natural Environment Research Council (UK) with support from Linacre College, Oxford.

Abstract: This thesis examines the evolutionary biology of the RNA viruses, a diverse group of pathogens that cause significant diseases. The focus of this work is the relationship between the processes driving the evolution of virus populations within individual hosts and at the epidemic level.

First, Chapter One reviews the basic biology of RNA viruses, the current state of knowledge in relevant topics of evolutionary virology, and the principles that underlie the most commonly used methods in this thesis.

In Chapter Two, I develop and test a novel framework to estimate the significance of phylogeny-trait association in viral phylogenies. The method incorporates phylogenetic uncertainty through the use of posterior sets of trees (PST) produced in Bayesian MCMC analyses.

In Chapter Three, I conduct a comprehensive analysis of the substitution rate of hepatitis C virus (HCV) in within- and between-host data sets using a relaxed molecular clock. I find that within-host substitution rates are more rapid than previously appreciated, that heterotachy is rife in within-host data sets, and that selection is likely to be a primary driver.

In Chapter Four I apply the techniques developed in Chapter Two to successfully detect compartmentalization between peripheral blood and cervical tissues in a large data set of human immunodeficiency virus (HIV) patients. I propose that compartmentalization in the cervix is maintained by selection.

I extend the framework developed in Chapter Two in Chapter Five and explore the Type II error of the statistics used.

In Chapter Six I review the findings of this thesis and conclude with a general discussion of the relationship between within- and among-host evolution in viruses, and some of the limitations of current techniques.

Estimating the Date of Origin of An HIV-1 Circulating Recombinant Form

Virology. 2009 Apr 25;387(1):229-34. Epub 2009 Mar 9.
Tee KK, Pybus OG, Parker J, Ng KP, Kamarulzaman A, Takebe Y.

HIV is capable of frequent genetic exchange through recombination. Despite the pandemic spread of HIV-1 recombinants, their times of origin are not well understood. We investigate the epidemic history of a HIV-1 circulating recombinant form (CRF) by estimating the time of the recombination event that lead to the emergence of CRF33_01B, a recently described recombinant descended from CRF01_AE and subtype B. The gag, pol and env genes were analyzed using a combined coalescent and relaxed molecular clock model, implemented in a Bayesian Markov chain Monte Carlo framework. Using linked genealogical trees we calculated the time interval between the common ancestor of CRF33_01B and the ancestors it shares with closely related parental lineages. The recombination event that generated CRF33_01B (t(rec)) occurred sometime between 1991 and 1993, suggesting that recombination is common in the early evolutionary history of HIV-1. The proof-of-concept approach provides a new tool for the investigation of HIV molecular epidemiology and evolution.