data | Lonely Joe Parker

Over the past few years I’ve been developing research, which I collectively refer to as ‘real-time phylogenomics’ – and this is the name of our mini-site for MinION-based rapid identification-by-sequencing. Since our paper on this will hopefully be published soon, it’s probably worth defining what I hope this term denotes now, what it does not – and ultimately where I hope this research is going.

‘Phylogenomics’ is simple enough, and Jonathan Eisen at UC Davis has been a fantastic advocate of the concept. Essentially, phylogenomics is scaled-up molecular systematics, with datasets (usually derived from a genome annotation and/or transcriptome) comprising many coding loci rather than a few genes. ‘Many’ in this case usually means hundreds, or thousands, so we’re typically looking at primarily nuclear genes, although organelles’ genomes may often be incorporated, since they’re usually far easier to reliably assemble and annotate. The aim is, basically to average phylogenetic signal over many loci by combining gene trees (or an analogous approach) to try and obtain phylogenies with higher confidence (single- or few-locus approaches, including barcodes no matter how judiciously chosen, capable of producing incorrect trees with high confidence). The process is intensive, since genomes must be sequenced and then assembled to a sufficient standard to be reasonably certain of identifying orthologous loci. This isn’t the only use of the term (which also refers to phylogenies produced from whole-genome metagenomics) but the most straightforward and common one as far as eukaryote genomics is concerned, and certainly the one uppermost in my mind.

However the results are often confusing, or at least more complex than we might hope: instead of a single phylogeny with high support from all loci, and robust to the model used, we often find a high proportion of gene trees (10-30%, perhaps) agree with each other, but not the modal (most common, e.g. majority rule consensus) tree topology. For instance among 2, 326 loci in our 2013 paper on phylogenomics of the major bat families, we found that position of a particular group of echolocators – which had been hotly debated for decades, based on morphological and single-locus approaches – showed such a pattern (sometimes supporting the traditional grouping of Microchiroptera + Megachiroptera, but over 60% of loci supporting the newer Yangochiroptera + Yinpterochiroptera system. This can be for a variety of reasons, some biological and some methodological. The point is that we have a sufficiently detailed picture to let us chose between competing phylogenetic hypothesis with both statistical confidence and intuition based on comparison.

These techniques have been on the horizon for a while (certainly since at least 2000) and gathered pace over the last decade with improvements in computing, informatics, and especially next-generation sequencing. The other half of this equation, ‘real-time’ sequencing, has emerged much more recently and centres, obviously, on the MinION sequencer. Most work using this so far has focused either on the very impressive potential long-read data offers for genomic analyses, particularly assembly, or rapid ID of samples e.g. the Quick/Loman Zika and Ebola monitoring studies; and our own work.

So what, exactly, do we hope to achieve with phylogenomic-type analyses using real-time MinION data, and why?

Well, firstly, our work so far has shown that the existing pipeline (sample -> transport -> sequence-> assemble genome-> annotate genes-> align loci-> build trees) has lots of room for speedups, and we’re fairly confident that the inevitable tradeoff with accuracy when you omit or simplify certain steps (laboratory-standard sequencing, assembly) is at least compensated for by the volume of data alone. Recall that a ‘normal’ phylogenomic tree similar to our bat one might take two or more postdocs/students a year to generate from biological samples, often longer. A process taking a week instead would let you generate something like 50 more analyses in a year! The most obvious application for this is just accelerating existing research, but the potential for transforming fieldwork and citizen science is considerable. This is because you can build trees that inform species relationships, even if the species in question isn’t known. In other words a phylogenome can both reliably identify an unknown sample, and also identify if it is a new species.

More excitingly, I think we need to have a deeper look at how we both construct and analyse evolutionary models. Life on earth can be accurately and fully described best by a network, not a bifurcating tree, but this applies to loci as well as single genes. In other words, there is a single network that connects every locus in every living thing. Phylogenetic trees are only a bifurcating projection of this, while single- or multi-locus networks only comprise a part.

We’ve hitherto ignored this fact, largely because (a) trees are often a good approximation, especially in the case of eukaryote nuclear genes, and (b) the data and computation requirements a ‘network-of-life’ analysis implies are formidable. However, cracks are beginning to appear, in both faces. Firstly, many loci are subject to real biological phenomena (horizontal gene transfer, selection leading to adaptive convergence, etc) which give erroneous trees as discussed above. Meanwhile prokaryotic and viral inference is rarely even this straightforward. Secondly, expanding computing power, algorithmic complexity, and sequencing capacity (imagine just 1,000 high schools across the world, regularly using a MinION for class projects…) mean the question for us today really isn’t ‘how do we get data’, but ‘how ambitious do we want to be with it?’

Since my PhD but especially since 2012, I’ve been developing this theme. Ultimately I think the answer lies in the continuous analysis of public phylogenomic data. Imagine multiple distributed but linked analyses, continuously running to infer parts of the network of life, updating their model asynchronously both as new data flood in, and by exchanging information with each other. This is really what we mean by real-time phylogenomics – nothing less than a complete Network of Life, living in the cloud, publicly available and collaboratively and continuously inferred from real-time sequence data.

So… that’s what I plan to spend the 2020s doing, anyway.

Tweet this Digg Post to LinkedIn Slashdot Stumble This

We were having a chat over coffee today and a question arose about merging data from multiple databases. At first sight this seems pretty easy, especially if you’re working with relational databases that have unique IDs (like, uh, a Latin binomial name – Homo sapiens) to hang from… right?

But, oh no.. not at all. One important reason is that seemingly similar data fields can be extremely tricky to merge. They may have been stated with differing precision (0.01, 0.0101, or 0.01010199999?), be encoded in different data types (text, float, blob, hex etc) or character set encodings (UTF-8 or Korean?) and even after all that, refer to subtly different quantities (mass vs weight perhaps). Who knew database ninjas actually earnt all that pay.

So it was surprising, but understandable, to learn that a major private big-data user (unnamed here) stores pretty much everything as text strings. Of course this solves one set of problems nicely (everyone knows how to parse/handle text, surely?) but creates another. That’s because it is trivially easy to code the same real-valued number in multiple different text strings – some of which may break sort algorithms, or even memory constraints. Consider the number ‘0.01’: as written there’s very little ambiguity for you and me. But what about:

“0.01”,
“00.01”,
” 0.01″ (note the space),
or even “0.01000000000”?

After a quick straw poll, we also realised that, although we knew how most of our most-used programming languages (Java for me, Perl, Python etc for others) performed the appropriate conversion in their native string-to-float methods. We knew how we thought they worked, and how we hoped they would, but it’s always worth checking. Time to write some quick code – here it is, on GitHub

And in code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

package uk.ac.qmul.sbcs.evolution.sandbox;

/**
* Class to test the Float.parseFloat() method performance on text data
*
In particular odd strings which should be equal, e.g.
*
<ul>
<li>"0.01"</li>
<li>"00.01"</li>
<li>" 0.01" (note space)</li>
<li>"0.0100"</li>
</ul>
*

NB uses assertions to test - run JVM with '-ea' argument. The first three tests should pass in the orthodox manner. The fourth should throw assertion errors to pass.
* @author joeparker
*
*/
public class TextToFloatParsingTest {

/**
* Default no-arg constructor
*/
public TextToFloatParsingTest(){
/* Set up the floats as strings*/
String[] floatsToConvert = {"0.01","00.01"," 0.01","0.0100"};
Float[] floatObjects = new Float[4];
float[] floatPrimitives = new float[4];

/* Convert the floats, first to Float objects and also cast to float primitives */
for(int i=0;i<4;i++){
floatObjects[i] = Float.parseFloat(floatsToConvert[i]);
floatPrimitives[i] = floatObjects[i];
}

/* Are they all equal? They should be: test this. Should PASS */
/* Iterate through the triangle */
System.out.println("Testing conversions: test 1/4 (should pass)...");
for(int i=0;i<4;i++){
for(int j=1;j<4;j++){
assert(floatPrimitives[i] == floatPrimitives[j]);
assert(floatObjects[i] == floatPrimitives[j]);
}
}
System.out.println("Test 1/4 passed OK");

/* Test the numerical equivalent */
System.out.println("Testing conversions: test 2/4 (should pass)...");
for(int i=0;i<4;i++){
assert(floatPrimitives[i] == 0.01f);
}
System.out.println("Test 2/4 passed OK");

/* Test the numerical equivalent inequality. Should PASS */
System.out.println("Testing conversions: test 3/4 (should pass)...");
for(int i=0;i<4;i++){
assert(floatPrimitives[i] != 0.02f);
}
System.out.println("Test 3/4 passed OK");

/* Test the inversion */
/* These assertions should FAIL*/
System.out.println("Testing conversions: test 4/4 (should fail with java.lang.AssertionError)...");
boolean test_4_pass_flag = false;
try{
for(int i=0;i<4;i++){
for(int j=1;j<4;j++){
assert(floatPrimitives[i] != floatPrimitives[j]);
assert(floatObjects[i] != floatPrimitives[j]);
test_4_pass_flag = true; // If AssertionErrors are thrown as we expect they will be, this is never reached.
}
}
}finally{
// test_4_pass_flag should never be set true (line 62) if AssertionErrors have been thrown correctly.
if(test_4_pass_flag){
System.err.println("Test 3/4 passed! This constitutes a logical FAILURE");
}else{
System.out.println("Test 4/4 passed OK (expected assertion errors occured as planned.");
}
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
new TextToFloatParsingTest();
}

}

If you run this with assertions enabled (‘/usr/bin/java -ea package uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest’) you should get something like:

Testing conversions: test 1/4 (should pass)...
Test 1/4 passed OK
Testing conversions: test 2/4 (should pass)...
Test 2/4 passed OK
Testing conversions: test 3/4 (should pass)...
Test 3/4 passed OK
Testing conversions: test 4/4 (should fail with java.lang.AssertionError)...
Exception in thread "main" java.lang.AssertionError
at uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest.<init>(TextToFloatParsingTest.java:60)
at uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest.main(TextToFloatParsingTest.java:76)
Test 4/4 passed OK (expected assertion errors occured as planned.

Tweet this Digg Post to LinkedIn Slashdot Stumble This