Tag Archives: phylogeny

What is ‘real-time’ phylogenomics?

Over the past few years I’ve been developing research, which I collectively refer to as ‘real-time phylogenomics’ – and this is the name of our mini-site for MinION-based rapid identification-by-sequencing. Since our paper on this will hopefully be published soon, it’s probably worth defining what I hope this term denotes now, what it does not – and ultimately where I hope this research is going.

‘Phylogenomics’ is simple enough, and Jonathan Eisen at UC Davis has been a fantastic advocate of the concept. Essentially, phylogenomics is scaled-up molecular systematics, with datasets (usually derived from a genome annotation and/or transcriptome) comprising many coding loci rather than a few genes. ‘Many’ in this case usually means hundreds, or thousands, so we’re typically looking at primarily nuclear genes, although organelles’ genomes may often be incorporated, since they’re usually far easier to reliably assemble and annotate. The aim is, basically to average phylogenetic signal over many loci by combining gene trees (or an analogous approach) to try and obtain phylogenies with higher confidence (single- or few-locus approaches, including barcodes no matter how judiciously chosen, capable of producing incorrect trees with high confidence). The process is intensive, since genomes must be sequenced and then assembled to a sufficient standard to be reasonably certain of identifying orthologous loci. This isn’t the only use of the term (which also refers to phylogenies produced from whole-genome metagenomics) but the most straightforward and common one as far as eukaryote genomics is concerned, and certainly the one uppermost in my mind.

However the results are often confusing, or at least more complex than we might hope: instead of a single phylogeny with high support from all loci, and robust to the model used, we often find a high proportion of gene trees (10-30%, perhaps) agree with each other, but not the modal (most common, e.g. majority rule consensus) tree topology. For instance among 2, 326 loci in our 2013 paper on phylogenomics of the major bat families, we found that position of a particular group of echolocators – which had been hotly debated for decades, based on morphological and single-locus approaches – showed such a pattern (sometimes supporting the traditional grouping of Microchiroptera + Megachiroptera, but over 60% of loci supporting the newer Yangochiroptera + Yinpterochiroptera system. This can be for a variety of reasons, some biological and some methodological. The point is that we have a sufficiently detailed picture to let us chose between competing phylogenetic hypothesis with both statistical confidence and intuition based on comparison.

These techniques have been on the horizon for a while (certainly since at least 2000) and gathered pace over the last decade with improvements in computing, informatics, and especially next-generation sequencing. The other half of this equation, ‘real-time’ sequencing, has emerged much more recently and centres, obviously, on the MinION sequencer. Most work using this so far has focused either on the very impressive potential long-read data offers for genomic analyses, particularly assembly, or rapid ID of samples e.g. the Quick/Loman Zika and Ebola monitoring studies; and our own work.

So what, exactly, do we hope to achieve with phylogenomic-type analyses using real-time MinION data, and why?

Well, firstly, our work so far has shown that the existing pipeline (sample -> transport -> sequence-> assemble genome-> annotate genes-> align loci-> build trees) has lots of room for speedups, and we’re fairly confident that the inevitable tradeoff with accuracy when you omit or simplify certain steps (laboratory-standard sequencing, assembly) is at least compensated for by the volume of data alone. Recall that a ‘normal’ phylogenomic tree similar to our bat one might take two or more postdocs/students a year to generate from biological samples, often longer. A process taking a week instead would let you generate something like 50 more analyses in a year! The most obvious application for this is just accelerating existing research, but the potential for transforming fieldwork and citizen science is considerable. This is because you can build trees that inform species relationships, even if the species in question isn’t known. In other words a phylogenome can both reliably identify an unknown sample, and also identify if it is a new species.

More excitingly, I think we need to have a deeper look at how we both construct and analyse evolutionary models. Life on earth can be accurately and fully described best by a network, not a bifurcating tree, but this applies to loci as well as single genes. In other words, there is a single network that connects every locus in every living thing. Phylogenetic trees are only a bifurcating projection of this, while single- or multi-locus networks only comprise a part.

We’ve hitherto ignored this fact, largely because (a) trees are often a good approximation, especially in the case of eukaryote nuclear genes, and (b) the data and computation requirements a ‘network-of-life’ analysis implies are formidable. However, cracks are beginning to appear, in both faces. Firstly, many loci are subject to real biological phenomena (horizontal gene transfer, selection leading to adaptive convergence, etc) which give erroneous trees as discussed above. Meanwhile prokaryotic and viral inference is rarely even this straightforward. Secondly, expanding computing power, algorithmic complexity, and sequencing capacity (imagine just 1,000 high schools across the world, regularly using a MinION for class projects…) mean the question for us today really isn’t ‘how do we get data’, but ‘how ambitious do we want to be with it?’

Since my PhD but especially since 2012, I’ve been developing this theme. Ultimately I think the answer lies in the continuous analysis of public phylogenomic data. Imagine multiple distributed but linked analyses, continuously running to infer parts of the network of life, updating their model asynchronously both as new data flood in, and by exchanging information with each other. This is really what we mean by real-time phylogenomics – nothing less than a complete Network of Life, living in the cloud, publicly available and collaboratively and continuously inferred from real-time sequence data.

So… that’s what I plan to spend the 2020s doing, anyway.

 

Befi-BaTS v0.1.1 alpha release

Long-overdue update for beta version of Befi-BaTS.

Software: Befi-BaTS

Author: Joe Parker

Version: 0.1.1 beta (download here)

Release notes: Befi-BaTS v0.1 beta drops support for hard polytomies (tree nodes with > 2 daughters), now throwing a HardPolytomyException to the error stack when these are parsed. This is because of potential bugs when dealing with topology + distance measures (NTI/NRI) of polytomies. These bugs will be fixed in a future release. The current version 0.1.1 improves #NEXUS input file parsing.

Befi-BaTS: Befi-BaTS uses two established statistics (the Association Index, AI (Wang et al., 2001), and Fitch parsimony score, PS) as well as a third statistic (maximum exclusive single-state clade size, MC) introduced by us in the BaTS citation, where the merits of each of these are discussed. Befi-BaTS 0.1.1 includes additional statistics that include branch length as well as tree topology. What sets Befi-BaTS aside from previous methods, however, is that we incorporate uncertainty arising from phylogenetic error into the analysis through a Bayesian framework. While other many other methods obtain a null distribution for significance testing through tip character randomization, they rely on a single tree upon which phylogeny-trait association is measured for any observed or expected set of tip characters.

Generation of neutralizing antibodies and divergence of SIVmac239 in cynomolgus macaques following short-term early antiretroviral therapy.

PLoS Pathog. 2010 Sep 2;6(9):e1001084.
Ozkaya Sahin G, Bowles EJ, Parker J, Uchtenhagen H, Sheik-Khalil E, Taylor S, Pybus OG, Mäkitalo B, Walther-Jallow L, Spångberg M, Thorstensson R, Achour A, Fenyö EM, Stewart-Jones GB, Spetz AL.

Neutralizing antibodies (NAb) able to react to heterologous viruses are generated during natural HIV-1 infection in some individuals. Further knowledge is required in order to understand the factors contributing to induction of cross-reactive NAb responses. Here a well-established model of experimental pathogenic infection in cynomolgus macaques, which reproduces long-lasting HIV-1 infection, was used to study the NAb response as well as the viral evolution of the highly neutralization-resistant SIVmac239. Twelve animals were infected intravenously with SIVmac239. Antiretroviral therapy (ART) was initiated ten days post-inoculation and administered daily for four months. Viral load, CD4(+) T-cell counts, total IgG levels, and breadth as well as strength of NAb in plasma were compared simultaneously over 14 months. In addition, envs from plasma samples were sequenced at three time points in all animals in order to assess viral evolution. We report here that seven of the 12 animals controlled viremia to below 10(4) copies/ml of plasma after discontinuation of ART and that this control was associated with a low level of evolutionary divergence. Macaques that controlled viral load developed broader NAb responses early on. Furthermore, escape mutations, such as V67M and R751G, were identified in virus sequenced from all animals with uncontrolled viremia. Bayesian estimation of ancestral population genetic diversity (PGD) showed an increase in this value in non-controlling or transient-controlling animals during the first 5.5 months of infection, in contrast to virus-controlling animals. Similarly, non- or transient controllers displayed more positively-selected amino-acid substitutions. An early increase in PGD, resulting in the generation of positively-selected amino-acid substitutions, greater divergence and relative high viral load after ART withdrawal, may have contributed to the generation of potent NAb in several animals after SIVmac239 infection. However, early broad NAb responses correlated with relatively preserved CD4(+) T-cell numbers, low viral load and limited viral divergence.

Full-Length Characterization of Hepatitis C Virus Subtype 3a Reveals Novel Hypervariable Regions under Positive Selection during Acute Infection

Humphreys I, Fleming V, Fabris P, Parker J, Schulenberg B, Brown A, Demetriou C, Gaudieri S, Pfafferott K, Lucas M, Collier J, Huang KH, Pybus OG, Klenerman P, Barnes E.

J Virol. 2009 Nov;83(22):11456-66. Epub 2009 Sep 9.

Hepatitis C virus subtype 3a is a highly prevalent and globally distributed strain that is often associated with infection via injection drug use. This subtype exhibits particular phenotypic characteristics. In spite of this, detailed genetic analysis of this subtype has rarely been performed. We performed full-length viral sequence analysis in 18 patients with chronic HCV subtype 3a infection and assessed genomic viral variability in comparison to other HCV subtypes. Two novel regions of intragenotypic hypervariability within the envelope protein E2, of HCV genotype 3a, were identified. We named these regions HVR495 and HVR575. They consisted of flanking conserved hydrophobic amino acids and central variable residues. A 5-amino-acid insertion found only in genotype 3a and a putative glycosylation site is contained within HVR575. Evolutionary analysis of E2 showed that positively selected sites within genotype 3a infection were largely restricted to HVR1, HVR495, and HVR575. Further analysis of clonal viral populations within single hosts showed that viral variation within HVR495 and HVR575 were subject to intrahost positive selecting forces. Longitudinal analysis of four patients with acute HCV subtype 3a infection sampled at multiple time points showed that positively selected mutations within HVR495 and HVR575 arose early during primary infection. HVR495 and HVR575 were not present in HCV subtypes 1a, 1b, 2a, or 6a. Some variability that was not subject to positive selection was present in subtype 4a HVR575. Further defining the functional significance of these regions may have important implications for genotype 3a E2 virus-receptor interactions and for vaccine studies that aim to induce cross-reactive anti-E2 antibodies.

Estimating the Date of Origin of An HIV-1 Circulating Recombinant Form

Virology. 2009 Apr 25;387(1):229-34. Epub 2009 Mar 9.
Tee KK, Pybus OG, Parker J, Ng KP, Kamarulzaman A, Takebe Y.

HIV is capable of frequent genetic exchange through recombination. Despite the pandemic spread of HIV-1 recombinants, their times of origin are not well understood. We investigate the epidemic history of a HIV-1 circulating recombinant form (CRF) by estimating the time of the recombination event that lead to the emergence of CRF33_01B, a recently described recombinant descended from CRF01_AE and subtype B. The gag, pol and env genes were analyzed using a combined coalescent and relaxed molecular clock model, implemented in a Bayesian Markov chain Monte Carlo framework. Using linked genealogical trees we calculated the time interval between the common ancestor of CRF33_01B and the ancestors it shares with closely related parental lineages. The recombination event that generated CRF33_01B (t(rec)) occurred sometime between 1991 and 1993, suggesting that recombination is common in the early evolutionary history of HIV-1. The proof-of-concept approach provides a new tool for the investigation of HIV molecular epidemiology and evolution.