Tag Archives: bioinformatics

Tent-seq: the paper (aka ‘field-based, real-time phylogenomics’)

Really proud to report that the first of our bona fide real-time phylogenomics papers is now out in Scientific Reports!

In the paper we managed to show a number of things that are potentially really exciting, and I’ll get to them in a minute. First though, this is the first paper I’ve published where I got to drive every part: from conceiving the idea (with Alex) to getting funding, planning and carrying out the fieldwork/field-sequencing (with Alex and Dion), sequencing (all by Andrew) and analysis and writing (everyone). This was incredibly satisfying as normally a lot of my time is spent analysing downstream data. I feel like a proper grown-up scientist now. More please!

Firstly, what did we actually do? Pretty simple really:

  • Over a week (May 2016) in Snowdonia National Park, Wales,
  • We collected the flower Arabidopsis thaliana and congeneric A. lyrata,
  • Extracted their DNA and prepared sequencing libraries for MinION sequencing, in a tent with no mains power or running water,
  • Sequenced both species using Oxford Nanopore MinION, and
  • Analysed them in real-time with BLAST databases held locally, building trees with a handful of genes.

Later on back in the lab we repeated the sequencing (but not extractions) with Illumina MiSeq, so we could compare the platforms, and also developed a few more sophisticated bioinformatics analyses. To be honest, most of the pipelines we ran could have run in real-time (and now do) but at the time of the main fieldwork we just didn’t expect it would work as well as it did!

Result 1: Genomic DNA sequencing with MinION is fairly easy, even in the middle of nowhere.

Seriously, depending on how much patience and practical skill you have, this is either easy, or really easy. We used the Oxford Nanopore 1D Rapid sequencing kit (disclaimer: actually a prototype ONT provided; though the COTS one is much better now) for sequencing, and extracted using the Qiagen DNEasy Plant Miniprep kit, modified with a longer initial incubation and double-concentrated cleaning step, but essentially unchanged. The MinION itself, as is well documented, runs off USB into a laptop.

Hardware-wise you’ll need:

  • Two waterbaths (or in our case, two polyboxes, a gas kettle, and some thermometers)
  • A centrifuge
  • A generator to run said devices
  • Some poly boxes with -20ºC and ice for reagents

… that’s it. If you’re looking at this list and thinking ‘I could get all that together by next weekend, maybe we should go on a sequencing trip’ well, that’s the idea 🙂

There’s a lot of refinements possible. A portable freezer will make life easier, as will a dedicated 12v supply for USB power and a portable DNA quantification tool like a Quantus. Plus, all of the above don’t really like rain so a tent and/or van (as with Nick Loman and Josh Quick’s Zika trip last year) will help out a lot. But to get started, that’s it.

Result 2: Long MinION reads are really good at species ID – even better than Illumina in key respects.

The core goal of this project was to work out “can field-based genomic WGS sequencing identify closely-related species?”. So we deliberately picked two species from the same genus with publicly-available reference genome sequences (A. thaliana A. lyrata). The ID process would be simple. For each of the four datasets (two MinION runs, one from each species, and two MiSeq 2x300bp paired-end runs, one for each species), we’d simply:

  1. Trim adapters from each read
  2. Match each read to the A. thaliana genome using the best hit with BLASTN
  3. Match each read to the A. lyrata genome, using the same method
  4. Compare the hit scores for each reference genome
  5. Un-blind the read (reveal which of the two species it actually came from)
  6. Score the read as a true or false positive or negative, depending on the result.

Clearly many reads finding a BLAST alignment for one species will also find a significant alignment to the other species, since these are separated by only a couple of MYr (and are pretty similar phenotypically). So if both hits are ‘significant’, how would we distinguish the best one? Intuitively it seems sensible that the longest match / most identities will be better, but what threshold length difference should we use? 1bp longer? 10bp? 100bp?

Happily the

ROCR

package in R lets you investigate the performance of test statistics on known classifier sets. We used this to produce the plot below, which shows the effect of increasing threshold length difference on true-positive (TP), false-positive (FP) and accuracy rates for MinION (black) and MiSeq (red) reads:Figure_2c-2d_length

The really key thing here is that MinION reads are beating MiSeq ones at most length difference (bias) thresholds greater than ~5-10bp, right up to 300bp (the MiSeq inserts top out at this length, of course). This is important because while here we’re matching orthogonal ID cases (A. lyrata against A. thaliana, and vice versa), in a practical application we might have a third species without a reference but two possible matches, and while some loci will be closer to the first, others could match the second. So while a threshold of 1bp might technically be the best (TP rate of ~90% and close to 100% accuracy), we may want to raise the threshold to a much higher value (>50bp) and accept lower TP rates to get a better confidence.

Result 3: Species ID does not need complete genomes for reference databases, and even works with a handful of MinION reads.

One very obvious and sensible criticism of our early drafts of this was that the reference genomes we used to build BLASTN databases with are largely completely assembled. While there’s been some handwringing recently about the structural variation of these plant genomes at population level, most people accept that a high proportion of the informative sequences in these genomes are now well determined.

For most people, in most places, this will not be the case; for instance, there’s ~300,000-400,000 plant species described, but only ~180-250 public genomes. Most of those are of the fairly-low-coverage HTS WGS variety as well, so are pretty bittily assembled. Quite often these come from first-year baselining experiments in the ‘get some DNA and run it through the MiSeq then SOAP or Abyss’ mould, with N50 values in the low ‘000s.

So to test the effect of this, we artificially digested the reference genomes a few thousand times to simulate N50 values from about 100 (virtually unassembled) up to 10^6 (essentially complete), shown here in Fig 3a for N50 values from 10^0 up to 10^4, with accuracy scores calculated for a range of cutoff values:

fig_3

These results were pretty promising, so finally we asked ourselves: OK, we had tens of thousands of MinION reads to make our ID with, generated over a day or so: but how few reads we would need to have a stab at a correct ID? Again, we jacknifed our dataset to find out, shown above in Figs 3b and 3c. Promisingly, you can see that by about 10^2-10^3 reads (in practice, an hour or less) the confidence intervals on our ID score barely budge. So, after an hour of sequencing, you’re likely to get as good an answer as you can get. One. Hour…!!!

Result 4: Field-sequenced WGS MinION long-reads substantially improve downstream genomics with low-coverage HTS data.

When planning the fieldwork we hadn’t really known what we’d get, in terms of read length, quality, or yield: this was a prototype kit that not many people had played with yet, let alone taken into the field. But about the time we were writing this up we found out about various genome assemblers optimised (supposedly) for ONT reads, chiefly Canu and hybrid-SPAdes. We decided to give it a whirl.. the results are pretty amazing!

Data MiSeq only MiSeq + MinION
Assembler Abyss hybridSPAdes
Illumina reads, 300bp paired-end 8,033,488 8,033,488
Illumina data (yield) 2,418 Mbp 2,418 Mbp
MinION reads, R7.3 + R9 kits,

N50 ~ 4,410bp

96,845
MinION data (yield) 240 Mbp
Approx. coverage 19.49x 19.49x + 2.01x
Assembly key statistics:
# contigs 24,999 10,644
Longest contig 90 Kbp 414 Kbp
N50 contiguity 7,853 bp 48,730 bp
Fraction of reference genome (%) 82 88
Errors, per 100 kbp:                           #N’s 1.7 5.4
# mismatches 518 588
# indels 120 130
Largest alignment 76,935 bp 264,039 bp
CEGMA gene completeness estimate:
# genes 219 of 248 245 of 248
% genes 88% 99%

Result 5: Individual MinION reads can be directly, individually annotated for coding loci with no assembly required.

By now everyone was getting a bit sick of me going on about MinION reads, but there was one final hunch I wanted to test: If reads are about the same length as nuclear coding loci (~5000-50,000bp), does that mean we can annotate individual reads to pull out coding sequences, and use them to build phylogenies? SNAP was a great tool for this, not least because it’s trained on A. thaliana gene models already.

I want to be absolutely clear here, as sometimes people seem to miss this: I’m not talking about assembling reads before annotation as usual. I’m not even talking about assembling them in real-time, then annotating. I mean, each time a read finishes basecalling, immediately try and annotate that single read, and only that single read, to try and get a coding sequence. 

In other words, how quickly can we turn a tube of DNA, into a folder of sequence reads, into alignments of coding loci? The answer is, ‘bloody quickly’:

Figure_2e_genes

The dashed line shows ‘all gene models’. The solid line shows unique CDS. The axis is the number of predicted CDS and yes, those are thousands – recall that the total number of CDS for A. thaliana is only about 23,000. The axis is, well, actual sequencing time (!)*

Now, not all of these are complete genes, and error rate means distinguishing paralogs robustly in a real case (e.g. completely novel genome) would be tricky, but on the other hand, this was a completely unoptimised pipeline, really just hacked together over a couple of weeks. There’s a lot of scope to improve this…

*These are the read timestamps, but it wasn’t a live run. I actually ran the analysis back in the lab afterwards, as my code was too buggy on the fieldwork day and I just lost my shit. But the CPU demands aren’t high – I can and have run this live subsequently.

 

Result 6: Predicted coding loci from individual MinION reads can be aligned to orthologous sequences, and multilocus phylogeny inferred in real-time.

I build trees. I build trees. I build trees for a living. Did you seriously think, that having got as far as spitting out thousands of novel coding loci per day in field-based sequencing, I wasn’t going to try some real-time, field-based, multilocus phylogenomic inference?

Figure_X_tree_hybridSPAdes

Here’s a *BEAST tree from 53 loci, all of the A. thaliana ones coming from directly-annotated, field-sequenced reads. Seriously 😉

Summary

As you’ve gathered by now, I’m enormously happy with this research. I think this paper is easily a bigger contribution to general science than our 2013 molecular convergence one because, if we’re honest, it’s an interesting but niche phenomenon, whereas literally everyone who uses or categorises any kind of biological material can benefit from this paper.

I’m indebted to my colleagues, Alex S. T. Papadopulos, Dion Devey, Andrew Helmstetter and Tim Wilkinson, and our funders, the Kew Foundation. We didn’t invent the MinION – that took hundreds of incredibly clever people at Oxford Nanopore years and millions of pounds of investment to do – but in this study we’ve managed to show all of the really transformative aspects of this technology working in the field, in real-time. There is no technical reason, at all why we shouldn’t all expect that within a decade, all of the analyses we currently run on DNA data in labs can run in the field, within minutes of collecting biological samples. And that really is something.

Inference and informatics in a ‘sequenced’ world

Short lecture relating my recent work on real-time phylogenomics, implications for bioinformatics research and future directions of genomic/phylogenetic modelling to explicitly account for phylogeny, synteny and identity through coloured graphs.

University of Reading, 2nd August 2017

Slides [SlideShare]: cc-by-nd

[slideshare id=78587606&doc=2017readingbioinfforgenomics-joeparker-final3-170805084405]

Using field-based DNA sequencing to accelerate phylogenomics

Invited seminar at the Department of Zoology, Oxford University, 30th November 2016.

Summary of our field-based real-time phylogenomics (MinION DNA sequencing) experiments this year, and applicability to broad-scale tree-of-life phylogenomics and macroevolutionary biology.

Slides [SlideShare]: cc-by-nd

[slideshare id=69767351&doc=2016oxfordzoojoeparker-161202163931]

Copying LOADS of files from a folder of LOADS *AND* LOADS more in OSX

Quick one this, as it’s a tricky problem I keep having to Google/SO. So I’m posting here for others but mainly myself too!

Here’s the situation: you have a folder (with, ooh, let’s say 140,000 separate MinION reads, for instance…) which contains a subset of files you want to move or copy somewhere else. Normally, you’d do something simple like a wildcarded ‘cp’ command, e.g.:

1
host:joeparker$ cp dir/*files_I_want* some/other_dir

Unfortunately, if the list of files matched by that wildcard is sufficiently long (more than a few thousand), you’ll get an error like this:

1
-bash: /bin/cp: Argument list too long

In other words, you’re going to have to be more clever. Using the GUI/Finder usually isn’t an option either at this point, as the directory size will likely defeat Finder, too. The solution is pretty simple but takes a bit of tweaking to work in OSX (full credit to posts here and here that got me started).

Basically, we’re going to use the ‘find’ command to locate the files we want, then pass each one in turn as an argument to ‘cp’ using ‘find -exec’. This is a bit slower overall than doing the equivalent as our original wildcarded command but since that won’t work we’ll have to lump it! The command* is:

1
find dir -name *files_I_want* -maxdepth 1 -exec cp {} some/other_dir \;</p>

Simple, eh? Enjoy 🙂

*NB, In this command:

  • dir is the filesystem path to start the search; ‘find’ will recursively traverse the directory tree including and below this folder;
  • -name *glob* gives the files to match;
  • -maxdepth is how deep to recurse (e.g. 1 = ‘don’t recurse’);
  • cp is the command we’re executing on each found file (could be mv etc);
  • {} is the pipe argument standing for the found file;
  • some/other_dir is the destination argument to the command invoked by -exec

How to fake an OSX theme appearance in Linux Ubuntu MATE

I’ve recently been fiddling about and trying to fake an OSX-style GUI appearance in Linux Ubuntu MATE (15.04). This is partly because I prefer the OSX GUI (let’s be honest) and partly because most of my colleagues are also Mac users mainly (bioinformaticians…) and students in particular fear change! The Mac-style appearance seems to calm people down. A bit.

The specific OS I’m going for is 10.9 Mavericks, because it’s my current favourite and nice and clear. There are two main things to set up: the OS itself and the appearance. Let’s take them in turn.

1. The OS

I’ve picked Ubuntu (why on Earth wouldn’t you?!) and specifically the MATE distribution. This has a lot of nice features that make it fairly Mac-y, and in particular the windowing and package management seem smoother to me than the vanilla Ubuntu Unity system. Get it here: https://ubuntu-mate.org/vivid/.* The installation is pretty painless on Mac, PC or an existing Linux system. If in doubt you can use a USB as the boot volume without affecting existing files; with a large enough partition (the core OS is about 1GB) you can save settings – including the customisation we’re about to apply!

*We’re installing the 15.04 version, not the newest release, as 15.04 is an LTS (long-term stable) Ubuntu distribution. This means it’s supported officially for a good few years yet. [Edit: Arkash (see below) kindly pointed out that 14.04 is the most recent LTS, not 15.04. My only real reason for using 15.04 therefore is ‘I quite like it and most of the bugs have gone'(!)]

2. The appearance

The MATE windowing system is very slick, but the green-ness is a bit, well, icky. We’re going to download a few appearance mods (themes, in Ubuntu parlance) which will improve things a bit. You’ll need to download these to your boot/install USB:

Boot the OS

Now that we’ve got everything we need, let’s boot up the OS. Insert the USB stick into your Mac-envious victim of choice, power it up and enter the BIOS menu (F12 in most cases) before the existing OS loads. Select the USB drive as the boot volume and continue.

Once the Ubuntu MATE session loads, you’ll have the option of trialling the OS from the live USB, or permanently installing it to a hard drive. For this computer I won’t be installing to a hard drive (long story) but using the USB, so customising that. Pick either option, but beware that customisations to the live USB OS will be lost should you later choose to install to a hard drive.

When you’re logged in, it’s time to smarten this baby up! First we’ll play with the dock a bit. From the top menu bar, select “System > Preferences > MATE Tweak” to open the windowing management tool. In the ‘Interface’ menu, change Panel Layouts to ‘Eleven’ and Icon Size to ‘Small’. In the ‘Windows’ menu, we’ll change Buttons Layout to ‘Contemporary (Left)’. Close the MATE Tweak window to save. This is already looking more Mac-y, with a dock area at the bottom of the screen, although the colours and icons are off.

Now we’ll apply some theme magic to fix that. Select “System > Preferences > Look and Feel > Appearance”. Now we can customise the appearance. Firstly, we’ll load both the ‘Ultra-Flat Yosemite Light’ and ‘OSX-MATE’ themes, so they’re available to our hybrid theme. Click the ‘Install..’ icon at the bottom of the theme selector, you’ll be able to select and install the Ultra-Flat Yosemite Light theme we downloaded above. It should unpack from the .zip archive and appear in the themes panel. Installing the OXS-MATE theme is slightly trickier:

  • Unzip (as sudo) the OSX-MATE theme to /usr/share/themes
  • Rename it from OSX-MATE-master to OSX-MATE if you downloaded it from git as a whole repository (again, you’ll need to sudo)
  • Restart the appearances panel and it should now appear in the themes panel.

We’ll create a new custom theme with the best bits from both themes, so click ‘Custom’ theme, then ‘Customise..’ to make a new one. Before you go any further, save it under a new name! Now we’ll apply changes to this theme. There are five customisations we can apply: Controls, Colours, Window Border, Icons and Pointer:

  • ControlsUltra-Flat Yosemite Light
  • Colours: There are eight colours to set here. Click each colour box then in the ‘Colour name’ field, enter:
    • Windows (foreground): #F0EAE7 / (background): #0F0F0E
    • Input boxes (fg): #FFFFFF / (bg): #0F0F0E
    • Selected items (fg): #003BFF / (bg): #F9F9F9
    • Tooltips: (fg): #2D2D2D / (bg): #DEDEDE
  • Window borderOSX-MATE
  • IconsFog
  • PointerDMZ (Black)

Save the theme again, and we’re done! Exit Appearance Preferences.

Finally we’ll install Solarized as the default terminal (command-line interface) theme, because I like it. In the MATE Terminal, Unzip the solarized-mate-terminal archive, as sudo. Enter the directory and simply run (as sudo) the install script using bash:


$ sudo unzip solarized-mate-terminal
$ cd solarized-mate-terminal
$ bash solarized-mate.sh

Close and restart the terminal. Hey presto! You should now be able to see the light/dark Solarized themes available, under ‘Edit > Profiles’. You’ll want to set one as the default when opening a new terminal.

Finally…

Later, I also installed Topmenu, a launchpad applet that gives an OSX-style top-anchored application menu to some linux programs. It’s a bit cranky and fiddly though, so you might want to give it a miss. But if you have time on your hands and really need that Cupertino flash, be my guest. I hope you’ve had a relatively easy install for the rest of this post, and if you’ve got any improvements, please let me know!

Happy Tweaking…

Messing about with the MinION

IMG_4543Molecular phylogenetics – uncovering the history of evolution using signals in organisms’ genetic sequences – is a powerful science, the latest expression of the human desire to understand our common origins. But for all its achievements, I’d always felt something was, well lacking from my science. This week we’ve been experimenting with the MinION for the first time, and now I know what that is: immediacy. I’ll return to this theme later to explain how exciting this realisation is, but first we better ask:

What is a MinION?

NGS nerds know about this already, of course, but for the rest of us: The MinION (minh-aye-on) is a USB-connected device marketed by a UK company, Oxford Nanopore, as ‘a portable real-time biological analyser’. Yes – that does sound a lot like a tricorder, and for good reason: just like the fictional device, it promises to make the instant identification of biological samples a reality. It does this using a radically different new way to ‘read’ DNA sequences from a liquid into a letters (‘A, C, G, T‘) on a computer screen. Essentially, individual strands of DNA are pulled through a hole (a ‘nanopore’) in an artificial membrane, like the membranes that surround every living cell. When an electric field is applied to the membrane, the individual DNA letters (actually, ‘hexamers’ – 6-letter chunks) passing through the nanopore can be directly detected as fluctuations in the field, much like a magnetic C-60 cassette tape is read by a magnetic tape head. The really important thing is that, whereas other existing DNA-reading (‘sequencing’) machines take days-to-weeks to read a sample, the MinION produces output in minutes or even seconds. It’s also (as you can see in the picture, above) a small, wait tiny device.

Real-time

This combination of fast results and small size makes the tricorder dream possible, but this is more than just a gadget. Really, having access to biological sequences at our fingertips will completely change biology and society in many ways, some of which Yaniv Erlich explored in a recent paper. In particular, molecular evolutionary biologists will soon be able to interact with their subject matter in ways that other scientists have long taken for granted. What I mean by this is that in many other empirical disciplines, researchers (and schoolkids!) are able to directly observe, or easily measure, the phenomena they study. Paleontologists dig bones. Seismologists feel earthquakes. Zoologists track lions. And so on. But until now, the world of genomic data hasn’t been directly observable. In fact reading DNA sequences is a slow, expensive pain in the arse. And in such a heavily empirical subject, I can’t help but feel this is a hindrance – we are burdened with a galaxy of baroque models to explain variations in the small number of observations we’ve made, and I’ve got a hunch more data would actually consign many of them to the dustbin.

Imagine formulating, testing, and validating or discarding genetic hypotheses as seamlessly as an ecologist might survey a new forest…

IMG_4531

Actual use

That’s the spiel, anyway. This week we actually used the MinION for the first time to sequence DNA (I’d been running some other technical tests for a couple of months, but this was our first attempt with real samples) and since it seems lots more colleagues want to ask about this device than have had a chance to use it, I thought I’d share our experiences. I say ‘we’ – work took place thanks to a Pilot Study Fund grant at the Royal Botanic Gardens, Kew, in collaboration with Dr. Alex Papadopulos (and really useful input from Drs. Andrew Helmstetter, Pepijn Kooij and Bryn Dentinger). It’s worth pointing out that although there’s a lot of hype about the device, it is still technically in a prototype/public-beta type phase, so things are changing all the time in terms of performance, etc.

IMG_4547First up, the size. The pictures don’t really do it justice… compared with existing machines (fridge-sized, often) the MinION isn’t small, it’s tiny. The thing itself, plus box and cable, would easily fit into a side pocket on any laptop bag. So the ‘portable’ bit is certainly true, as far as the platform itself is concerned. But there’s a catch – to prepare a biological sample for sequencing on the MinION, you first have to go through a fairly complicated set of lab steps. Known collectively as ‘library preparation’ (a library here meaning a test tube containing a set of DNA molecules that have been specially treated to prepare them to be sequenced), the existing lab protocol took us several hours the first time. Partly that was due to the sheer number of curious onlookers, but partly because some of the requisite steps (bead cleanups etc) just need time and concentration to get right. None of the steps is particularly complicated (I’m crap at labwork and just about followed along) but there’s quite a few, and you have to follow the steps in the protocol carefully.

Performance

So how did the MinION do? We prepared two libraries; one from a control sample of bacteriophage lambda (a viral genome, used to check the lab steps are working) and another from Silene latifolia, a small flowering plant and one of Alex’s faves. The results were exciting. In fact they were nerve-wracking – initially we mixed up the two samples by mistake (incredible – what pros…) and were really worried when, after a few hours’ sequencing, not a single DNA read matching the lambda genome had appeared. After a lot of worrying, and restarting various analyses, including the MinION itself, we eventually realised our mistake, reloaded the MinION with the other sample and – hey presto! – lambda reads started to pour out of the software. In the end, we were able to get a 500x coverage really quickly (see reads mapped to the reference genome, below):

lambda

You can see from the top plot that we got good even coverage across the genome, while the bottom plot shows an even (ish) read length distribution, peaking around 4kb. We’d tried for a target size of 6kb (shearing using g-TUBEs), so it seems the actual output read size distribution is lower than the shearing target – you’d need to aim higher to compensate. Still, we got plenty of reads longer than 20, or even 30,000 base-pairs (bp) – from a 47kbp genome this is great, and much much much longer than typical Illumina paired-end reads of ~hundreds of bp.

Later on the next day, we decided we’d done enough to be sure the library preparation and sequencing were working well, so we switched from the lambda control to our experimental (S.latifolia) sample. Again, we got a good steady stream of reads, and some were really long, over 65,000bp. Crucially, we were able to BLAST these against NCBI and get hits against the NCBI public sequence database and get Silene hits straight back. We were also able to map them onto the Silene genome directly using BWA, even with no trimming or masking low-quality regions. Overall, we got nearly a full week’s sequencing from our flowcell, with ~24,000 reads at a mean length of ~4kb. Not bad, and we’d have got more if we hadn’t wasted the best part of the run sodding about while we troubleshot our library mix-up at the start (the number of active pores, and hence sequencing throughput, declines over time).

The bad: There is a price for these long, rapid reads. Firstly, the accuracy is definitely lower than Illumina – although the average lambda library sequencing accuracy in our first library was nearly 90%, on some of our mapped Silene reads it dropped much lower – 70% or so. This has been well documented, of course. Secondly, the *big* difference between the MinION in use and a MiSeq or HiSeq is the sheer level of involvement needed to get the best from the device – whereas both Illumina machines are essentially push-button operated, the MinION seems to respond well if you treat it kindly (reloading etc) and not if you don’t (introducing air bubbles while loading seemed to wreck some pores)

The very, very, good: Ultimately though, we came away convinced that these issues won’t matter at all in the long run. ONT have said that the library prep and accuracy are both boing improved, as are flowcell quality (prototype, remember). Most importantly the MinION isn’t really a direct competitor to the HiSeq. It’s a completely different instrument. Yes, they both read DNA sequences, but that’s where the similarity ends. The MinION is just so flexible, there’s almost an infinite variety of uses.

In particular, we were really struck that the length of the reads means simple algorithms like BLAST can get a really good match from any sample with just a few hundreds, or tens, of reads from a sample. You just don’t need millions of 150bp reads to match an unknown sample to a database with reads this long! Coupled with the fact that the software is real-time and flowcells can be stopped and started, and you have the bones of a really capable genetic identification system for all sorts of uses; disease outbreaks (in fact see this great work on Ebola using MinION); customs control of endangered species; agriculture; brewing – virtually anything, in fact, where finding out about living organisms’ identity or function is needed.

The future.

There’s absolutely no question at all: really, these devices are the future of biology. Maybe the MinION will take off (right now, the buzz couldn’t be hotter), or a competitor will find a way to do even more amazing things. It doesn’t matter how it comes about, though – the next generation of biologists really will be living in a world where, in 10, 15, 20 years, at most, the sheer ubiquity of sequencers like this will mean that most, if not all eukaryotes’ genomes will have been sequenced, and so can be easily matched against an unknown sample. If that sounds fantastic, consider that the cumulative sequencing output of the entire world in 1988 amounted to a little over 100,000 sequences, a yield equivalent to a single good MinION run. And while most people in 1988 thought that, since the human genome might take 30 or 40 years to sequence, there was little point in even starting, a few others looked around at the new technology, its potential, and drew a different conclusion. The rest is history…

Real-time Phylogenomics

General science talk about the potential of real-time phylogenomics, delivered at the Jodrell Lecture Theatre, Kew Gardens, November 2nd 2015

Slides [SlideShare]: cc-by-nc-nd

[slideshare id=54651010&doc=real-time-phylogenomics-joeparker-151102162613-lva1-app6892]

Omics in extreme Environments (Lightweight bioinformatics)

Presentation on lightweight bioinformatics (Raspi / cloud computing) for real-time field-based analyses.

Presented at iEOS2015, St. Andrews, 3-6th July 2015.

Slides [SlideShare]: cc-by-nc-nd

[slideshare id=50251856&doc=joeparkerlightweightbioinformatics-150707112254-lva1-app6892]

Parsing numbers from multiple formats in Java

We were having a chat over coffee today and a question arose about merging data from multiple databases. At first sight this seems pretty easy, especially if you’re working with relational databases that have unique IDs (like, uh, a Latin binomial name – Homo sapiens) to hang from… right?

But, oh no.. not at all. One important reason is that seemingly similar data fields can be extremely tricky to merge. They may have been stated with differing precision (0.01, 0.0101, or 0.01010199999?), be encoded in different data types (text, float, blob, hex etc) or character set encodings (UTF-8 or Korean?) and even after all that, refer to subtly different quantities (mass vs weight perhaps). Who knew database ninjas actually earnt all that pay.

So it was surprising, but understandable, to learn that a major private big-data user (unnamed here) stores pretty much everything as text strings. Of course this solves one set of problems nicely (everyone knows how to parse/handle text, surely?) but creates another. That’s because it is trivially easy to code the same real-valued number in multiple different text strings – some of which may break sort algorithms, or even memory constraints. Consider the number ‘0.01’: as written there’s very little ambiguity for you and me. But what about:

“0.01”,
“00.01”,
” 0.01″ (note the space),
or even “0.01000000000”?

After a quick straw poll, we also realised that, although we knew how most of our most-used programming languages (Java for me, Perl, Python etc for others) performed the appropriate conversion in their native string-to-float methods. We knew how we thought they worked, and how we hoped they would, but it’s always worth checking. Time to write some quick code – here it is, on GitHub

And in code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
package uk.ac.qmul.sbcs.evolution.sandbox;

/**
* Class to test the Float.parseFloat() method performance on text data
*
In particular odd strings which should be equal, e.g.
*
<ul>
    <li>"0.01"</li>
    <li>"00.01"</li>
    <li>" 0.01" (note space)</li>
    <li>"0.0100"</li>
</ul>
*

NB uses assertions to test - run JVM with '-ea' argument. The first three tests should pass in the orthodox manner. The fourth should throw assertion errors to pass.
* @author joeparker
*
*/

public class TextToFloatParsingTest {

/**
* Default no-arg constructor
*/

public TextToFloatParsingTest(){
/* Set up the floats as strings*/
String[] floatsToConvert = {"0.01","00.01"," 0.01","0.0100"};
Float[] floatObjects = new Float[4];
float[] floatPrimitives = new float[4];

/* Convert the floats, first to Float objects and also cast to float primitives */
for(int i=0;i&lt;4;i++){
floatObjects[i] = Float.parseFloat(floatsToConvert[i]);
floatPrimitives[i] = floatObjects[i];
}

/* Are they all equal? They should be: test this. Should PASS */
/* Iterate through the triangle */
System.out.println("Testing conversions: test 1/4 (should pass)...");
for(int i=0;i&lt;4;i++){
for(int j=1;j&lt;4;j++){
assert(floatPrimitives[i] == floatPrimitives[j]);
assert(floatObjects[i] == floatPrimitives[j]);
}
}
System.out.println("Test 1/4 passed OK");

/* Test the numerical equivalent */
System.out.println("Testing conversions: test 2/4 (should pass)...");
for(int i=0;i&lt;4;i++){
assert(floatPrimitives[i] == 0.01f);
}
System.out.println("Test 2/4 passed OK");

/* Test the numerical equivalent inequality. Should PASS */
System.out.println("Testing conversions: test 3/4 (should pass)...");
for(int i=0;i&lt;4;i++){
assert(floatPrimitives[i] != 0.02f);
}
System.out.println("Test 3/4 passed OK");

/* Test the inversion */
/* These assertions should FAIL*/
System.out.println("Testing conversions: test 4/4 (should fail with java.lang.AssertionError)...");
boolean test_4_pass_flag = false;
try{
for(int i=0;i&lt;4;i++){
for(int j=1;j&lt;4;j++){
assert(floatPrimitives[i] != floatPrimitives[j]);
assert(floatObjects[i] != floatPrimitives[j]);
test_4_pass_flag = true; // If AssertionErrors are thrown as we expect they will be, this is never reached.
}
}
}finally{
// test_4_pass_flag should never be set true (line 62) if AssertionErrors have been thrown correctly.
if(test_4_pass_flag){
System.err.println("Test 3/4 passed! This constitutes a logical FAILURE");
}else{
System.out.println("Test 4/4 passed OK (expected assertion errors occured as planned.");
}
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
new TextToFloatParsingTest();
}

}


If you run this with assertions enabled (‘/usr/bin/java -ea package uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest’) you should get something like:

Testing conversions: test 1/4 (should pass)...
Test 1/4 passed OK
Testing conversions: test 2/4 (should pass)...
Test 2/4 passed OK
Testing conversions: test 3/4 (should pass)...
Test 3/4 passed OK
Testing conversions: test 4/4 (should fail with java.lang.AssertionError)...
Exception in thread "main" java.lang.AssertionError
    at uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest.<init>(TextToFloatParsingTest.java:60)
    at uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest.main(TextToFloatParsingTest.java:76)
Test 4/4 passed OK (expected assertion errors occured as planned.