Category Archives: Journals

Rapid, raw-read reference and identification (R4IDs): A flexible platform for rapid generic species ID using long-read sequencing technology.

Our paper on rapid identification of samples using partial, low-coverage, MinION-sequenced reference databases for ID (at the Kew Science Festival) is in preprint. See here on BiorXiv: doi: 10.1101/281048.

In it, we show (with empirical data and simulation) that the length and bias of MinION reads makes them ideal for sample ID – better than NGS, under certain conditions – even where no reference assembly is available and a genomic skim is used to BLAST against, instead. Because these genome-skim-DBs are quick to generate, we call them ‘rapid-raw-reaad-reference fo ID’, or ‘R4IDs’ for short.

The code to repeat these analyses or set up a R4IDs analysis is on GitHub but I’ve also packaged this as Docker containers:

A few rough corners to sand while we decide where to submit it, but comments welcome in the meantime. Thanks, as ever to my colleagues (long-suffering Alex and Andrew) for all their stress:

Rapid, raw-read reference and identification (R4IDs): A flexible platform for rapid generic species ID using long-read sequencing technology.

Joe Parker* Andrew J. HelmstetterAlexander S. T. Papadopulos*


The versatility of the current DNA sequencing platforms and the development of portable, nanopore sequencers means that it has never been easier to collect genetic data for unknown sample ID. DNA barcoding and meta-barcoding have become increasingly popular and barcode databases continue to grow at an impressive rate. However, the number of canonical genome assemblies (reference or draft) that are publicly available is relatively tiny, hindering the more widespread use of genome scale DNA sequencing technology for accurate species identification and discovery. Here, we show that rapid raw-read reference datasets, or R4IDs for short, generated in a matter of hours on the Oxford Nanopore MinION, can bridge this gap and accelerate the generation of useable reference sequence data. By exploiting the long read length of this technology, shotgun genomic sequencing of a small portion of an organism’s genome can act as a suitable reference database despite the low sequencing coverage. These R4IDs can then be used for accurate species identification with minimal amounts of re-sequencing effort (<1000s of reads). We demonstrated the capabilities of this approach with six vascular plant species for which we created R4IDs in the laboratory and then re-sequenced, live at the Kew Science Festival 2016. We further validated our method using simulations to determine the broader applicability of the approach. Our data analysis pipeline has been made available as a Dockerised workflow for simple, scalable deployment for a range of uses.

Our Nature paper! Genome-wide molecular convergence in echolocating mammals

Exciting news from the lab this week… we’ve published in one of the leading journals, Nature!!!

Much of my work in the Rossiter BatLab for the last couple of years has centred around the search for genomic signatures of molecular convergence. This means looking for similar genetic changes in otherwise unrelated organisms. We’d normally expect unrelated organisms to differ considerably in their genetic sequences, because over time random mutations occur in their genomes; the more time has passed since two species diverged, the more changes we expect. However, we know that similar structures may evolve in unrelated species due to shared selection pressures (think of the streamlined body shapes of sharks, icthyosaurs and dolphins, for example). Can these pressures produce identical changes right down at the level of genetic sequences? We hoped to detect identical genetic changes in unrelated species (in this case, the echolocation – ‘sonar hearing’ – in some species of bats and whales) caused by similar selection pressures operating on the evolution of the genes required for those traits.

It’s been a long slog – we’ve had to write a complicated computer program to look at millions of letters of DNA – but this week it all bears fruit. We found that a <em>staggering</em> number of genes in the genomes of echolocating bats and whales (a bottlenose dolphin, if you must) showed evidence of these similar genetic changes, known technically as ‘genetic convergence’.

Obviously we started jumping up and down when we found this, and because we imagined other scientists would too, we wrote up our findings and sent them to the journal <em>Nature</em>, one of the top journals in the world of science… and crossed our fingers.

Well, today we can finally reveal that we were able to get through the peer-review process (where anonymous experts scrutinise your working – a bit like an MOT for your experiments), and the paper is out today!

But what do we actually say? Well:
<blockquote>Evolution is typically thought to proceed through divergence of genes, proteins and ultimately phenotypes. However, similar traits might also evolve convergently in unrelated taxa owing to similar selection pressures. Adaptive phenotypic convergence is widespread in nature, and recent results from several genes have suggested that this phenomenon is powerful enough to also drive recurrent evolution at the sequence level. Where homoplasious substitutions do occur these have long been considered the result of neutral processes. However, recent studies have demonstrated that adaptive convergent sequence evolution can be detected in vertebrates using statistical methods that model parallel evolution, although the extent to which sequence convergence between genera occurs across genomes is unknown. Here we analyse genomic sequence data in mammals that have independently evolved echolocation and show that convergence is not a rare process restricted to several loci but is instead widespread, continuously distributed and commonly driven by natural selection acting on a small number of sites per locus. Systematic analyses of convergent sequence evolution in 805,053 amino acids within 2,326 orthologous coding gene sequences compared across 22 mammals (including four newly sequenced bat genomes) revealed signatures consistent with convergence in nearly 200 loci. Strong and significant support for convergence among bats and the bottlenose dolphin was seen in numerous genes linked to hearing or deafness, consistent with an involvement in echolocation. Unexpectedly, we also found convergence in many genes linked to vision: the convergent signal of many sensory genes was robustly correlated with the strength of natural selection. This first attempt to detect genome-wide convergent sequence evolution across divergent taxa reveals the phenomenon to be much more pervasive than previously recognized.</blockquote>
Congrats to Steve, Georgia and Joe! After a few deserved beers we’ll have our work cut out to pick through all these genes and work out exactly what all of them do (guessing the genes’ biological functions, especially in non-model (read:not us or things we eat) organisms like bats and dolphins is notoriously tricky. So we’ll probably stick our heads out of the lab again in <em>another</em> two years…

The full citation is: Parker, J., Tsagkogeorga, G., Cotton, J.A.C., Liu, R., Stupka, E., Provero, P. &amp; Rossiter, S.J. (2013) Genome-wide signatures of convergent evolution in echolocating mammals. <em>Nature</em> (epub ahead of print), 4th September 2013. doi:10.1038/nature12511. This work was funded by Biotechnology and Biological Sciences Research Council (UK) grant BB/H017178/1.


The mode and tempo of hepatitis C virus evolution within and among hosts.

BMC Evol Biol. 2011 May 19;11(1):131. [Epub ahead of print]

Gray RR*, Parker J*, Lemey P, Salemi M, Katzourakis A, Pybus OG.

*These authors contributed equally to this article.


Hepatitis C virus (HCV) is a rapidly-evolving RNA virus that establishes chronic infections in humans. Despite the virus’ public health importance and a wealth of sequence data, basic aspects of HCV molecular evolution remain poorly understood. Here we investigate three sets of whole HCV genomes in order to directly compare the evolution of whole HCV genomes at different biological levels: within- and among-hosts. We use a powerful Bayesian inference framework that incorporates both among-lineage rate heterogeneity and phylogenetic uncertainty into estimates of evolutionary parameters.


Most of the HCV genome evolves at ~0.001 substitutions/site/year, a rate typical of RNA viruses. The antigenically-important E1/E2 genome region evolves particularly quickly, with correspondingly high rates of positive selection, as inferred using two related measures. Crucially, in this region an exceptionally higher rate was observed for within-host evolution compared to among-host evolution. Conversely, higher rates of evolution were seen among-hosts for functionally relevant parts of the NS5A gene. There was also evidence for slightly higher evolutionary rate for HCV subtype 1a compared to subtype 1b.


Using new statistical methods and comparable whole genome datasets we have quantified, for the first time, the variation in HCV evolutionary dynamics at different scales of organisation. This confirms that differences in molecular evolution between biological scales are not restricted to HIV and may represent a common feature of chronic RNA viral infection. We conclude that the elevated rate observed in the E1/E2 region during within-host evolution more likely results from the reversion of host-specific adaptations (resulting in slower long-term among-host evolution) than from the preferential transmission of slowly-evolving lineages.

Molecular epidemiology and phylogeny reveals complex spatial dynamics of endemic canine parvovirus.

J Virol. 2011 May 18. [Epub ahead of print]

Clegg SR, Coyne KP, Parker J, Dawson S, Godsall SA, Pinchbeck G, Cripps PJ, Gaskell RM, Radford AD.

Canine parvovirus 2 (CPV-2) is a severe enteric pathogen of dogs, causing high mortality in unvaccinated dogs. After emerging, CPV-2 spread rapidly worldwide. However, there is now some evidence to suggest that international transmission appears to be more restricted. In order to investigate the transmission and evolution of CPV-2 both nationally and in relation to the global situation, we have used a long range PCR to amplify and sequence the full VP2 gene of 150 canine parvoviruses obtained from a large cross-sectional sample of dogs presenting with severe diarrhoea to veterinarians in the UK, over a two year period. Amongst these 150 strains, 50 different DNA sequence types were identified, and apart from one case, all appeared unique to the UK. Phylogenetic analysis provided clear evidence for spatial clustering at the international level, and for the first time also at the national level, with the geographical range of some sequence types appearing to be highly restricted within the UK. Evolution of the VP2 gene in this dataset was associated with a lack of positive selection. In addition, the majority of predicted amino acid sequences were identical to those found elsewhere in the world, suggesting CPV VP2 has evolved a highly fit conformation. Based on typing systems using key amino acid mutations, 43% of viruses were CPV 2a, 57% CPV 2b, with no type 2 or 2c found. However phylogenetic analysis suggested complex antigenic evolution of this virus, with both type 2a and 2b viruses appearing polyphyletic. As such, typing based on specific amino acid mutations may not reflect the true epidemiology of this virus. The geographical restriction we observed both within the UK, and between the UK and other countries, together with the lack of CPV-2c in this population, strongly suggest the spread of CPV within its population may be heterogeneously subject to limiting factors. This cross-sectional study of national and global CPV phylogeographic segregation reveals a substantially more complex epidemic structure than previously described.

Generation of neutralizing antibodies and divergence of SIVmac239 in cynomolgus macaques following short-term early antiretroviral therapy.

PLoS Pathog. 2010 Sep 2;6(9):e1001084.
Ozkaya Sahin G, Bowles EJ, Parker J, Uchtenhagen H, Sheik-Khalil E, Taylor S, Pybus OG, Mäkitalo B, Walther-Jallow L, Spångberg M, Thorstensson R, Achour A, Fenyö EM, Stewart-Jones GB, Spetz AL.

Neutralizing antibodies (NAb) able to react to heterologous viruses are generated during natural HIV-1 infection in some individuals. Further knowledge is required in order to understand the factors contributing to induction of cross-reactive NAb responses. Here a well-established model of experimental pathogenic infection in cynomolgus macaques, which reproduces long-lasting HIV-1 infection, was used to study the NAb response as well as the viral evolution of the highly neutralization-resistant SIVmac239. Twelve animals were infected intravenously with SIVmac239. Antiretroviral therapy (ART) was initiated ten days post-inoculation and administered daily for four months. Viral load, CD4(+) T-cell counts, total IgG levels, and breadth as well as strength of NAb in plasma were compared simultaneously over 14 months. In addition, envs from plasma samples were sequenced at three time points in all animals in order to assess viral evolution. We report here that seven of the 12 animals controlled viremia to below 10(4) copies/ml of plasma after discontinuation of ART and that this control was associated with a low level of evolutionary divergence. Macaques that controlled viral load developed broader NAb responses early on. Furthermore, escape mutations, such as V67M and R751G, were identified in virus sequenced from all animals with uncontrolled viremia. Bayesian estimation of ancestral population genetic diversity (PGD) showed an increase in this value in non-controlling or transient-controlling animals during the first 5.5 months of infection, in contrast to virus-controlling animals. Similarly, non- or transient controllers displayed more positively-selected amino-acid substitutions. An early increase in PGD, resulting in the generation of positively-selected amino-acid substitutions, greater divergence and relative high viral load after ART withdrawal, may have contributed to the generation of potent NAb in several animals after SIVmac239 infection. However, early broad NAb responses correlated with relatively preserved CD4(+) T-cell numbers, low viral load and limited viral divergence.

Safety and immunogenicity of novel recombinant BCG and modified vaccinia virus Ankara vaccines in neonate rhesus macaques.

J Virol. 2010 Aug;84(15):7815-21. Epub 2010 May 19.
Rosario M, Fulkerson J, Soneji S, Parker J, Im EJ, Borthwick N, Bridgeman A, Bourne C, Joseph J, Sadoff JC, Hanke T

Although major inroads into making antiretroviral therapy available in resource-poor countries have been made, there is an urgent need for an effective vaccine administered shortly after birth, which would protect infants from acquiring human immunodeficiency virus type 1 (HIV-1) through breast-feeding. Bacillus Calmette-Guérin (BCG) is given to most infants at birth, and its recombinant form could be used to prime HIV-1-specific responses for a later boost by heterologous vectors delivering the same HIV-1-derived immunogen. Here, two groups of neonate Indian rhesus macaques were immunized with either novel candidate vaccine BCG.HIVA(401) or its parental strain AERAS-401, followed by two doses of recombinant modified vaccinia virus Ankara MVA.HIVA. The HIVA immunogen is derived from African clade A HIV-1. All vaccines were safe, giving local reactions consistent with the expected response at the injection site. No systemic adverse events or gross abnormality was seen at necropsy. Both AERAS-401 and BCG.HIVA(401) induced high frequencies of BCG-specific IFN-gamma-secreting lymphocytes that declined over 23 weeks, but the latter failed to induce detectable HIV-1-specific IFN-gamma responses. MVA.HIVA elicited HIV-1-specific IFN-gamma responses in all eight animals, but, except for one animal, these responses were weak. The HIV-1-specific responses induced in infants were lower compared to historic data generated by the two HIVA vaccines in adult animals but similar to other recombinant poxviruses tested in this model. This is the first time these vaccines were tested in newborn monkeys. These results inform further infant vaccine development and provide comparative data for two human infant vaccine trials of MVA.HIVA.

Full-Length Characterization of Hepatitis C Virus Subtype 3a Reveals Novel Hypervariable Regions under Positive Selection during Acute Infection

Humphreys I, Fleming V, Fabris P, Parker J, Schulenberg B, Brown A, Demetriou C, Gaudieri S, Pfafferott K, Lucas M, Collier J, Huang KH, Pybus OG, Klenerman P, Barnes E.

J Virol. 2009 Nov;83(22):11456-66. Epub 2009 Sep 9.

Hepatitis C virus subtype 3a is a highly prevalent and globally distributed strain that is often associated with infection via injection drug use. This subtype exhibits particular phenotypic characteristics. In spite of this, detailed genetic analysis of this subtype has rarely been performed. We performed full-length viral sequence analysis in 18 patients with chronic HCV subtype 3a infection and assessed genomic viral variability in comparison to other HCV subtypes. Two novel regions of intragenotypic hypervariability within the envelope protein E2, of HCV genotype 3a, were identified. We named these regions HVR495 and HVR575. They consisted of flanking conserved hydrophobic amino acids and central variable residues. A 5-amino-acid insertion found only in genotype 3a and a putative glycosylation site is contained within HVR575. Evolutionary analysis of E2 showed that positively selected sites within genotype 3a infection were largely restricted to HVR1, HVR495, and HVR575. Further analysis of clonal viral populations within single hosts showed that viral variation within HVR495 and HVR575 were subject to intrahost positive selecting forces. Longitudinal analysis of four patients with acute HCV subtype 3a infection sampled at multiple time points showed that positively selected mutations within HVR495 and HVR575 arose early during primary infection. HVR495 and HVR575 were not present in HCV subtypes 1a, 1b, 2a, or 6a. Some variability that was not subject to positive selection was present in subtype 4a HVR575. Further defining the functional significance of these regions may have important implications for genotype 3a E2 virus-receptor interactions and for vaccine studies that aim to induce cross-reactive anti-E2 antibodies.

Estimating the Date of Origin of An HIV-1 Circulating Recombinant Form

Virology. 2009 Apr 25;387(1):229-34. Epub 2009 Mar 9.
Tee KK, Pybus OG, Parker J, Ng KP, Kamarulzaman A, Takebe Y.

HIV is capable of frequent genetic exchange through recombination. Despite the pandemic spread of HIV-1 recombinants, their times of origin are not well understood. We investigate the epidemic history of a HIV-1 circulating recombinant form (CRF) by estimating the time of the recombination event that lead to the emergence of CRF33_01B, a recently described recombinant descended from CRF01_AE and subtype B. The gag, pol and env genes were analyzed using a combined coalescent and relaxed molecular clock model, implemented in a Bayesian Markov chain Monte Carlo framework. Using linked genealogical trees we calculated the time interval between the common ancestor of CRF33_01B and the ancestors it shares with closely related parental lineages. The recombination event that generated CRF33_01B (t(rec)) occurred sometime between 1991 and 1993, suggesting that recombination is common in the early evolutionary history of HIV-1. The proof-of-concept approach provides a new tool for the investigation of HIV molecular epidemiology and evolution.

Correlating Viral Phenotypes With Phylogeny: Accounting for Phylogenetic Uncertainty

Infect Genet Evol. 2008 May;8(3):239-46. Epub 2007 Aug 21.
Parker J, Rambaut A, Pybus OG.

Many recent studies have sought to quantify the degree to which viral phenotypic characters (such as epidemiological risk group, geographic location, cell tropism, drug resistance state, etc.) are correlated with shared ancestry, as represented by a viral phylogenetic tree. Here, we present a new Bayesian Markov-Chain Monte Carlo approach to the investigation of such phylogeny-trait correlations. This method accounts for uncertainty arising from phylogenetic error and provides a statistical significance test of the null hypothesis that traits are associated randomly with phylogeny tips. We perform extensive simulations to explore and compare the behaviour of three statistics of phylogeny-trait correlation. Finally, we re-analyse two existing published data sets as case studies. Our framework aims to provide an improvement over existing methods for this problem.