Rapid, raw-read reference and identification (R4IDs): A flexible platform for rapid generic species ID using long-read sequencing technology.

Our paper on rapid identification of samples using partial, low-coverage, MinION-sequenced reference databases for ID (at the Kew Science Festival) is in preprint. See here on BiorXiv: doi: 10.1101/281048.

In it, we show (with empirical data and simulation) that the length and bias of MinION reads makes them ideal for sample ID – better than NGS, under certain conditions – even where no reference assembly is available and a genomic skim is used to BLAST against, instead. Because these genome-skim-DBs are quick to generate, we call them ‘rapid-raw-reaad-reference fo ID’, or ‘R4IDs’ for short.

The code to repeat these analyses or set up a R4IDs analysis is on GitHub but I’ve also packaged this as Docker containers: hub.docker.com/lonelyjoeparker

A few rough corners to sand while we decide where to submit it, but comments welcome in the meantime. Thanks, as ever to my colleagues (long-suffering Alex and Andrew) for all their stress:

Rapid, raw-read reference and identification (R4IDs): A flexible platform for rapid generic species ID using long-read sequencing technology.

Joe Parker* Andrew J. HelmstetterAlexander S. T. Papadopulos*


The versatility of the current DNA sequencing platforms and the development of portable, nanopore sequencers means that it has never been easier to collect genetic data for unknown sample ID. DNA barcoding and meta-barcoding have become increasingly popular and barcode databases continue to grow at an impressive rate. However, the number of canonical genome assemblies (reference or draft) that are publicly available is relatively tiny, hindering the more widespread use of genome scale DNA sequencing technology for accurate species identification and discovery. Here, we show that rapid raw-read reference datasets, or R4IDs for short, generated in a matter of hours on the Oxford Nanopore MinION, can bridge this gap and accelerate the generation of useable reference sequence data. By exploiting the long read length of this technology, shotgun genomic sequencing of a small portion of an organism’s genome can act as a suitable reference database despite the low sequencing coverage. These R4IDs can then be used for accurate species identification with minimal amounts of re-sequencing effort (<1000s of reads). We demonstrated the capabilities of this approach with six vascular plant species for which we created R4IDs in the laboratory and then re-sequenced, live at the Kew Science Festival 2016. We further validated our method using simulations to determine the broader applicability of the approach. Our data analysis pipeline has been made available as a Dockerised workflow for simple, scalable deployment for a range of uses.