Tag Archives: phylogenomics

Real-time phylogenomics or ‘Some interesting problems in genomic big data’

Talk given at a technology/informatics company, London, Feb 2018.

[slideshare id=87391225&doc=joe-parker-reak-time-phylogenomics-180207132740]

An overview of contemporary advances and remaining problems in big-data biology, especially phylogenomics.

Tent-seq: the paper (aka ‘field-based, real-time phylogenomics’)

Really proud to report that the first of our bona fide real-time phylogenomics papers is now out in Scientific Reports!

In the paper we managed to show a number of things that are potentially really exciting, and I’ll get to them in a minute. First though, this is the first paper I’ve published where I got to drive every part: from conceiving the idea (with Alex) to getting funding, planning and carrying out the fieldwork/field-sequencing (with Alex and Dion), sequencing (all by Andrew) and analysis and writing (everyone). This was incredibly satisfying as normally a lot of my time is spent analysing downstream data. I feel like a proper grown-up scientist now. More please!

Firstly, what did we actually do? Pretty simple really:

  • Over a week (May 2016) in Snowdonia National Park, Wales,
  • We collected the flower Arabidopsis thaliana and congeneric A. lyrata,
  • Extracted their DNA and prepared sequencing libraries for MinION sequencing, in a tent with no mains power or running water,
  • Sequenced both species using Oxford Nanopore MinION, and
  • Analysed them in real-time with BLAST databases held locally, building trees with a handful of genes.

Later on back in the lab we repeated the sequencing (but not extractions) with Illumina MiSeq, so we could compare the platforms, and also developed a few more sophisticated bioinformatics analyses. To be honest, most of the pipelines we ran could have run in real-time (and now do) but at the time of the main fieldwork we just didn’t expect it would work as well as it did!

Result 1: Genomic DNA sequencing with MinION is fairly easy, even in the middle of nowhere.

Seriously, depending on how much patience and practical skill you have, this is either easy, or really easy. We used the Oxford Nanopore 1D Rapid sequencing kit (disclaimer: actually a prototype ONT provided; though the COTS one is much better now) for sequencing, and extracted using the Qiagen DNEasy Plant Miniprep kit, modified with a longer initial incubation and double-concentrated cleaning step, but essentially unchanged. The MinION itself, as is well documented, runs off USB into a laptop.

Hardware-wise you’ll need:

  • Two waterbaths (or in our case, two polyboxes, a gas kettle, and some thermometers)
  • A centrifuge
  • A generator to run said devices
  • Some poly boxes with -20ºC and ice for reagents

… that’s it. If you’re looking at this list and thinking ‘I could get all that together by next weekend, maybe we should go on a sequencing trip’ well, that’s the idea 🙂

There’s a lot of refinements possible. A portable freezer will make life easier, as will a dedicated 12v supply for USB power and a portable DNA quantification tool like a Quantus. Plus, all of the above don’t really like rain so a tent and/or van (as with Nick Loman and Josh Quick’s Zika trip last year) will help out a lot. But to get started, that’s it.

Result 2: Long MinION reads are really good at species ID – even better than Illumina in key respects.

The core goal of this project was to work out “can field-based genomic WGS sequencing identify closely-related species?”. So we deliberately picked two species from the same genus with publicly-available reference genome sequences (A. thaliana A. lyrata). The ID process would be simple. For each of the four datasets (two MinION runs, one from each species, and two MiSeq 2x300bp paired-end runs, one for each species), we’d simply:

  1. Trim adapters from each read
  2. Match each read to the A. thaliana genome using the best hit with BLASTN
  3. Match each read to the A. lyrata genome, using the same method
  4. Compare the hit scores for each reference genome
  5. Un-blind the read (reveal which of the two species it actually came from)
  6. Score the read as a true or false positive or negative, depending on the result.

Clearly many reads finding a BLAST alignment for one species will also find a significant alignment to the other species, since these are separated by only a couple of MYr (and are pretty similar phenotypically). So if both hits are ‘significant’, how would we distinguish the best one? Intuitively it seems sensible that the longest match / most identities will be better, but what threshold length difference should we use? 1bp longer? 10bp? 100bp?

Happily the

ROCR

package in R lets you investigate the performance of test statistics on known classifier sets. We used this to produce the plot below, which shows the effect of increasing threshold length difference on true-positive (TP), false-positive (FP) and accuracy rates for MinION (black) and MiSeq (red) reads:Figure_2c-2d_length

The really key thing here is that MinION reads are beating MiSeq ones at most length difference (bias) thresholds greater than ~5-10bp, right up to 300bp (the MiSeq inserts top out at this length, of course). This is important because while here we’re matching orthogonal ID cases (A. lyrata against A. thaliana, and vice versa), in a practical application we might have a third species without a reference but two possible matches, and while some loci will be closer to the first, others could match the second. So while a threshold of 1bp might technically be the best (TP rate of ~90% and close to 100% accuracy), we may want to raise the threshold to a much higher value (>50bp) and accept lower TP rates to get a better confidence.

Result 3: Species ID does not need complete genomes for reference databases, and even works with a handful of MinION reads.

One very obvious and sensible criticism of our early drafts of this was that the reference genomes we used to build BLASTN databases with are largely completely assembled. While there’s been some handwringing recently about the structural variation of these plant genomes at population level, most people accept that a high proportion of the informative sequences in these genomes are now well determined.

For most people, in most places, this will not be the case; for instance, there’s ~300,000-400,000 plant species described, but only ~180-250 public genomes. Most of those are of the fairly-low-coverage HTS WGS variety as well, so are pretty bittily assembled. Quite often these come from first-year baselining experiments in the ‘get some DNA and run it through the MiSeq then SOAP or Abyss’ mould, with N50 values in the low ‘000s.

So to test the effect of this, we artificially digested the reference genomes a few thousand times to simulate N50 values from about 100 (virtually unassembled) up to 10^6 (essentially complete), shown here in Fig 3a for N50 values from 10^0 up to 10^4, with accuracy scores calculated for a range of cutoff values:

fig_3

These results were pretty promising, so finally we asked ourselves: OK, we had tens of thousands of MinION reads to make our ID with, generated over a day or so: but how few reads we would need to have a stab at a correct ID? Again, we jacknifed our dataset to find out, shown above in Figs 3b and 3c. Promisingly, you can see that by about 10^2-10^3 reads (in practice, an hour or less) the confidence intervals on our ID score barely budge. So, after an hour of sequencing, you’re likely to get as good an answer as you can get. One. Hour…!!!

Result 4: Field-sequenced WGS MinION long-reads substantially improve downstream genomics with low-coverage HTS data.

When planning the fieldwork we hadn’t really known what we’d get, in terms of read length, quality, or yield: this was a prototype kit that not many people had played with yet, let alone taken into the field. But about the time we were writing this up we found out about various genome assemblers optimised (supposedly) for ONT reads, chiefly Canu and hybrid-SPAdes. We decided to give it a whirl.. the results are pretty amazing!

Data MiSeq only MiSeq + MinION
Assembler Abyss hybridSPAdes
Illumina reads, 300bp paired-end 8,033,488 8,033,488
Illumina data (yield) 2,418 Mbp 2,418 Mbp
MinION reads, R7.3 + R9 kits,

N50 ~ 4,410bp

96,845
MinION data (yield) 240 Mbp
Approx. coverage 19.49x 19.49x + 2.01x
Assembly key statistics:
# contigs 24,999 10,644
Longest contig 90 Kbp 414 Kbp
N50 contiguity 7,853 bp 48,730 bp
Fraction of reference genome (%) 82 88
Errors, per 100 kbp:                           #N’s 1.7 5.4
# mismatches 518 588
# indels 120 130
Largest alignment 76,935 bp 264,039 bp
CEGMA gene completeness estimate:
# genes 219 of 248 245 of 248
% genes 88% 99%

Result 5: Individual MinION reads can be directly, individually annotated for coding loci with no assembly required.

By now everyone was getting a bit sick of me going on about MinION reads, but there was one final hunch I wanted to test: If reads are about the same length as nuclear coding loci (~5000-50,000bp), does that mean we can annotate individual reads to pull out coding sequences, and use them to build phylogenies? SNAP was a great tool for this, not least because it’s trained on A. thaliana gene models already.

I want to be absolutely clear here, as sometimes people seem to miss this: I’m not talking about assembling reads before annotation as usual. I’m not even talking about assembling them in real-time, then annotating. I mean, each time a read finishes basecalling, immediately try and annotate that single read, and only that single read, to try and get a coding sequence. 

In other words, how quickly can we turn a tube of DNA, into a folder of sequence reads, into alignments of coding loci? The answer is, ‘bloody quickly’:

Figure_2e_genes

The dashed line shows ‘all gene models’. The solid line shows unique CDS. The axis is the number of predicted CDS and yes, those are thousands – recall that the total number of CDS for A. thaliana is only about 23,000. The axis is, well, actual sequencing time (!)*

Now, not all of these are complete genes, and error rate means distinguishing paralogs robustly in a real case (e.g. completely novel genome) would be tricky, but on the other hand, this was a completely unoptimised pipeline, really just hacked together over a couple of weeks. There’s a lot of scope to improve this…

*These are the read timestamps, but it wasn’t a live run. I actually ran the analysis back in the lab afterwards, as my code was too buggy on the fieldwork day and I just lost my shit. But the CPU demands aren’t high – I can and have run this live subsequently.

 

Result 6: Predicted coding loci from individual MinION reads can be aligned to orthologous sequences, and multilocus phylogeny inferred in real-time.

I build trees. I build trees. I build trees for a living. Did you seriously think, that having got as far as spitting out thousands of novel coding loci per day in field-based sequencing, I wasn’t going to try some real-time, field-based, multilocus phylogenomic inference?

Figure_X_tree_hybridSPAdes

Here’s a *BEAST tree from 53 loci, all of the A. thaliana ones coming from directly-annotated, field-sequenced reads. Seriously 😉

Summary

As you’ve gathered by now, I’m enormously happy with this research. I think this paper is easily a bigger contribution to general science than our 2013 molecular convergence one because, if we’re honest, it’s an interesting but niche phenomenon, whereas literally everyone who uses or categorises any kind of biological material can benefit from this paper.

I’m indebted to my colleagues, Alex S. T. Papadopulos, Dion Devey, Andrew Helmstetter and Tim Wilkinson, and our funders, the Kew Foundation. We didn’t invent the MinION – that took hundreds of incredibly clever people at Oxford Nanopore years and millions of pounds of investment to do – but in this study we’ve managed to show all of the really transformative aspects of this technology working in the field, in real-time. There is no technical reason, at all why we shouldn’t all expect that within a decade, all of the analyses we currently run on DNA data in labs can run in the field, within minutes of collecting biological samples. And that really is something.

What is ‘real-time’ phylogenomics?

Over the past few years I’ve been developing research, which I collectively refer to as ‘real-time phylogenomics’ – and this is the name of our mini-site for MinION-based rapid identification-by-sequencing. Since our paper on this will hopefully be published soon, it’s probably worth defining what I hope this term denotes now, what it does not – and ultimately where I hope this research is going.

‘Phylogenomics’ is simple enough, and Jonathan Eisen at UC Davis has been a fantastic advocate of the concept. Essentially, phylogenomics is scaled-up molecular systematics, with datasets (usually derived from a genome annotation and/or transcriptome) comprising many coding loci rather than a few genes. ‘Many’ in this case usually means hundreds, or thousands, so we’re typically looking at primarily nuclear genes, although organelles’ genomes may often be incorporated, since they’re usually far easier to reliably assemble and annotate. The aim is, basically to average phylogenetic signal over many loci by combining gene trees (or an analogous approach) to try and obtain phylogenies with higher confidence (single- or few-locus approaches, including barcodes no matter how judiciously chosen, capable of producing incorrect trees with high confidence). The process is intensive, since genomes must be sequenced and then assembled to a sufficient standard to be reasonably certain of identifying orthologous loci. This isn’t the only use of the term (which also refers to phylogenies produced from whole-genome metagenomics) but the most straightforward and common one as far as eukaryote genomics is concerned, and certainly the one uppermost in my mind.

However the results are often confusing, or at least more complex than we might hope: instead of a single phylogeny with high support from all loci, and robust to the model used, we often find a high proportion of gene trees (10-30%, perhaps) agree with each other, but not the modal (most common, e.g. majority rule consensus) tree topology. For instance among 2, 326 loci in our 2013 paper on phylogenomics of the major bat families, we found that position of a particular group of echolocators – which had been hotly debated for decades, based on morphological and single-locus approaches – showed such a pattern (sometimes supporting the traditional grouping of Microchiroptera + Megachiroptera, but over 60% of loci supporting the newer Yangochiroptera + Yinpterochiroptera system. This can be for a variety of reasons, some biological and some methodological. The point is that we have a sufficiently detailed picture to let us chose between competing phylogenetic hypothesis with both statistical confidence and intuition based on comparison.

These techniques have been on the horizon for a while (certainly since at least 2000) and gathered pace over the last decade with improvements in computing, informatics, and especially next-generation sequencing. The other half of this equation, ‘real-time’ sequencing, has emerged much more recently and centres, obviously, on the MinION sequencer. Most work using this so far has focused either on the very impressive potential long-read data offers for genomic analyses, particularly assembly, or rapid ID of samples e.g. the Quick/Loman Zika and Ebola monitoring studies; and our own work.

So what, exactly, do we hope to achieve with phylogenomic-type analyses using real-time MinION data, and why?

Well, firstly, our work so far has shown that the existing pipeline (sample -> transport -> sequence-> assemble genome-> annotate genes-> align loci-> build trees) has lots of room for speedups, and we’re fairly confident that the inevitable tradeoff with accuracy when you omit or simplify certain steps (laboratory-standard sequencing, assembly) is at least compensated for by the volume of data alone. Recall that a ‘normal’ phylogenomic tree similar to our bat one might take two or more postdocs/students a year to generate from biological samples, often longer. A process taking a week instead would let you generate something like 50 more analyses in a year! The most obvious application for this is just accelerating existing research, but the potential for transforming fieldwork and citizen science is considerable. This is because you can build trees that inform species relationships, even if the species in question isn’t known. In other words a phylogenome can both reliably identify an unknown sample, and also identify if it is a new species.

More excitingly, I think we need to have a deeper look at how we both construct and analyse evolutionary models. Life on earth can be accurately and fully described best by a network, not a bifurcating tree, but this applies to loci as well as single genes. In other words, there is a single network that connects every locus in every living thing. Phylogenetic trees are only a bifurcating projection of this, while single- or multi-locus networks only comprise a part.

We’ve hitherto ignored this fact, largely because (a) trees are often a good approximation, especially in the case of eukaryote nuclear genes, and (b) the data and computation requirements a ‘network-of-life’ analysis implies are formidable. However, cracks are beginning to appear, in both faces. Firstly, many loci are subject to real biological phenomena (horizontal gene transfer, selection leading to adaptive convergence, etc) which give erroneous trees as discussed above. Meanwhile prokaryotic and viral inference is rarely even this straightforward. Secondly, expanding computing power, algorithmic complexity, and sequencing capacity (imagine just 1,000 high schools across the world, regularly using a MinION for class projects…) mean the question for us today really isn’t ‘how do we get data’, but ‘how ambitious do we want to be with it?’

Since my PhD but especially since 2012, I’ve been developing this theme. Ultimately I think the answer lies in the continuous analysis of public phylogenomic data. Imagine multiple distributed but linked analyses, continuously running to infer parts of the network of life, updating their model asynchronously both as new data flood in, and by exchanging information with each other. This is really what we mean by real-time phylogenomics – nothing less than a complete Network of Life, living in the cloud, publicly available and collaboratively and continuously inferred from real-time sequence data.

So… that’s what I plan to spend the 2020s doing, anyway.

 

Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology applications

A short presentation to the British Society for Plant Pathology’s ‘Grand Challenges in Plant Pathology’ workshop on the uses of real-time DNA/RNA sequencing technology for plant health applications.

Doctoral Training Centre, University of Oxford, 14th September 2016.

Slides [SlideShare]: cc-by-nc-nd

[slideshare id=66051562&doc=smrt-nanopore-gcpp-joeparker-160915100855]

Phylogenomic convergence detection: lessons and perspectives

Talk presented at the 18th Evolutionary Biology Meeting At Marseille (programme), 16th-19th September 2014.

(Powerpoint – note this is a draft, not the final talk, pending authorisation): EBMdraft

[slideshare id=41517262&doc=ebmjoeparkerconvergencefinal-recover-nosampling-141113102943-conversion-gate01]

Using x-windows to render GUIs for server-client phylogenomics

The power of modern computers allied to high-throughput next-generation sequencing will usher in a new era of quantitative biology that will deliver the moon on a stick to every researcher on Earth…

Well, quite.

But something we’ve run up against recently in the lab is actually performing useful analyses on our phylogenomic datasets after the main evolutionary analyses. That is, once you’ve assembled, aligned, built phylogenies and measured interesting things like divergence rates and dN/dS – how do you display and explore the collected set of loci?

I’ll give you an example – suppose we have a question like “what is the correlation between tree length and mean dS rate in our 1,000 loci?” It’s fairly simple to spit the relevant statistics out into text files and then cat them together for analysis in R (or Matlab, but please, for the love of God, not Excel…). And so far, this approach has got a lot of work done for us.


But suppose we wanted to change the question slightly – maybe we notice some pattern when eyeballing a few of the most interesting alignments – and instead ask something like “what is the correlation between tree length and mean dS rate in our 1,000 loci, where polar residues are involved with a sitewise log-likelihood of ≤ 2 units?” Or <3? or <10? Most of this information is already available at runtime, parsed or calculated. We could tinker with our collation script again, re-run it, and do another round of R analyses; but this is wasteful on a variety of levels. It would be much more useful to interact with the data on the server while it’s in memory. We’ve been starting to look at ways of doing this, and that’s for another post. But for now there’s another question to solve before that becomes useful, namely – how do you execute a GUI on a server but run it on a client?

There are a variety of bioinformatics applications (Geneious; Galaxy; CLCBio) with client-server functionality built-in, but in each case we’d have to wrap our analysis in a plugin layer or module in order to use it. Since we’ve already got the analysis in this case, that seems unnecessary, so instead we’re trialling X-windows. This is nothing to do with Microsoft Windows. It’s just a display server which piggybacks an SSH connection to display UNIX Quartz (display) device remotely. It sounds simple, and it is!

To set up an x-windowing session, we just have to first set up the connection to the bioinformatics server:

ssh -Y user@host.ac.uk

Now we have a secure x-window connection, our server-based GUI will display on the client if we simply call it (MDK.jar):

User:~ /usr/bin/java -jar ~/Documents/dev-builds/mdk/MDK.jar

Rather pleasingly simple, yes? Now there’s a few issues with this, not least a noticeable cursor lag. But – remembering that the aim here is simply to interact with analysis data held in memory at runtime, rather than any do complicated editing – we can probably live with that, for now at least. On to the next task!

(also published at evolve.sbcs)

Our Nature paper! Genome-wide molecular convergence in echolocating mammals

Exciting news from the lab this week… we’ve published in one of the leading journals, Nature!!!

Much of my work in the Rossiter BatLab for the last couple of years has centred around the search for genomic signatures of molecular convergence. This means looking for similar genetic changes in otherwise unrelated organisms. We’d normally expect unrelated organisms to differ considerably in their genetic sequences, because over time random mutations occur in their genomes; the more time has passed since two species diverged, the more changes we expect. However, we know that similar structures may evolve in unrelated species due to shared selection pressures (think of the streamlined body shapes of sharks, icthyosaurs and dolphins, for example). Can these pressures produce identical changes right down at the level of genetic sequences? We hoped to detect identical genetic changes in unrelated species (in this case, the echolocation – ‘sonar hearing’ – in some species of bats and whales) caused by similar selection pressures operating on the evolution of the genes required for those traits.

It’s been a long slog – we’ve had to write a complicated computer program to look at millions of letters of DNA – but this week it all bears fruit. We found that a <em>staggering</em> number of genes in the genomes of echolocating bats and whales (a bottlenose dolphin, if you must) showed evidence of these similar genetic changes, known technically as ‘genetic convergence’.

Obviously we started jumping up and down when we found this, and because we imagined other scientists would too, we wrote up our findings and sent them to the journal <em>Nature</em>, one of the top journals in the world of science… and crossed our fingers.

Well, today we can finally reveal that we were able to get through the peer-review process (where anonymous experts scrutinise your working – a bit like an MOT for your experiments), and the paper is out today!

But what do we actually say? Well:
<blockquote>Evolution is typically thought to proceed through divergence of genes, proteins and ultimately phenotypes. However, similar traits might also evolve convergently in unrelated taxa owing to similar selection pressures. Adaptive phenotypic convergence is widespread in nature, and recent results from several genes have suggested that this phenomenon is powerful enough to also drive recurrent evolution at the sequence level. Where homoplasious substitutions do occur these have long been considered the result of neutral processes. However, recent studies have demonstrated that adaptive convergent sequence evolution can be detected in vertebrates using statistical methods that model parallel evolution, although the extent to which sequence convergence between genera occurs across genomes is unknown. Here we analyse genomic sequence data in mammals that have independently evolved echolocation and show that convergence is not a rare process restricted to several loci but is instead widespread, continuously distributed and commonly driven by natural selection acting on a small number of sites per locus. Systematic analyses of convergent sequence evolution in 805,053 amino acids within 2,326 orthologous coding gene sequences compared across 22 mammals (including four newly sequenced bat genomes) revealed signatures consistent with convergence in nearly 200 loci. Strong and significant support for convergence among bats and the bottlenose dolphin was seen in numerous genes linked to hearing or deafness, consistent with an involvement in echolocation. Unexpectedly, we also found convergence in many genes linked to vision: the convergent signal of many sensory genes was robustly correlated with the strength of natural selection. This first attempt to detect genome-wide convergent sequence evolution across divergent taxa reveals the phenomenon to be much more pervasive than previously recognized.</blockquote>
Congrats to Steve, Georgia and Joe! After a few deserved beers we’ll have our work cut out to pick through all these genes and work out exactly what all of them do (guessing the genes’ biological functions, especially in non-model (read:not us or things we eat) organisms like bats and dolphins is notoriously tricky. So we’ll probably stick our heads out of the lab again in <em>another</em> two years…

The full citation is: Parker, J., Tsagkogeorga, G., Cotton, J.A.C., Liu, R., Stupka, E., Provero, P. &amp; Rossiter, S.J. (2013) Genome-wide signatures of convergent evolution in echolocating mammals. <em>Nature</em> (epub ahead of print), 4th September 2013. doi:10.1038/nature12511. This work was funded by Biotechnology and Biological Sciences Research Council (UK) grant BB/H017178/1.

&nbsp;