Inference and informatics in a ‘sequenced’ world

Short lecture relating my recent work on real-time phylogenomics, implications for bioinformatics research and future directions of genomic/phylogenetic modelling to explicitly account for phylogeny, synteny and identity through coloured graphs.

University of Reading, 2nd August 2017

Slides [SlideShare]: cc-by-nd

Posted in Publications, Talks | Tagged , , , , , | Leave a comment

What is ‘real-time’ phylogenomics?

Over the past few years I’ve been developing research, which I collectively refer to as ‘real-time phylogenomics’ – and this is the name of our mini-site for MinION-based rapid identification-by-sequencing. Since our paper on this will hopefully be published soon, it’s probably worth defining what I hope this term denotes now, what it does not – and ultimately where I hope this research is going.

‘Phylogenomics’ is simple enough, and Jonathan Eisen at UC Davis has been a fantastic advocate of the concept. Essentially, phylogenomics is scaled-up molecular systematics, with datasets (usually derived from a genome annotation and/or transcriptome) comprising many coding loci rather than a few genes. ‘Many’ in this case usually means hundreds, or thousands, so we’re typically looking at primarily nuclear genes, although organelles’ genomes may often be incorporated, since they’re usually far easier to reliably assemble and annotate. The aim is, basically to average phylogenetic signal over many loci by combining gene trees (or an analogous approach) to try and obtain phylogenies with higher confidence (single- or few-locus approaches, including barcodes no matter how judiciously chosen, capable of producing incorrect trees with high confidence). The process is intensive, since genomes must be sequenced and then assembled to a sufficient standard to be reasonably certain of identifying orthologous loci. This isn’t the only use of the term (which also refers to phylogenies produced from whole-genome metagenomics) but the most straightforward and common one as far as eukaryote genomics is concerned, and certainly the one uppermost in my mind.

However the results are often confusing, or at least more complex than we might hope: instead of a single phylogeny with high support from all loci, and robust to the model used, we often find a high proportion of gene trees (10-30%, perhaps) agree with each other, but not the modal (most common, e.g. majority rule consensus) tree topology. For instance among 2, 326 loci in our 2013 paper on phylogenomics of the major bat families, we found that position of a particular group of echolocators – which had been hotly debated for decades, based on morphological and single-locus approaches – showed such a pattern (sometimes supporting the traditional grouping of Microchiroptera + Megachiroptera, but over 60% of loci supporting the newer Yangochiroptera + Yinpterochiroptera system. This can be for a variety of reasons, some biological and some methodological. The point is that we have a sufficiently detailed picture to let us chose between competing phylogenetic hypothesis with both statistical confidence and intuition based on comparison.

These techniques have been on the horizon for a while (certainly since at least 2000) and gathered pace over the last decade with improvements in computing, informatics, and especially next-generation sequencing. The other half of this equation, ‘real-time’ sequencing, has emerged much more recently and centres, obviously, on the MinION sequencer. Most work using this so far has focused either on the very impressive potential long-read data offers for genomic analyses, particularly assembly, or rapid ID of samples e.g. the Quick/Loman Zika and Ebola monitoring studies; and our own work.

So what, exactly, do we hope to achieve with phylogenomic-type analyses using real-time MinION data, and why?

Well, firstly, our work so far has shown that the existing pipeline (sample -> transport -> sequence-> assemble genome-> annotate genes-> align loci-> build trees) has lots of room for speedups, and we’re fairly confident that the inevitable tradeoff with accuracy when you omit or simplify certain steps (laboratory-standard sequencing, assembly) is at least compensated for by the volume of data alone. Recall that a ‘normal’ phylogenomic tree similar to our bat one might take two or more postdocs/students a year to generate from biological samples, often longer. A process taking a week instead would let you generate something like 50 more analyses in a year! The most obvious application for this is just accelerating existing research, but the potential for transforming fieldwork and citizen science is considerable. This is because you can build trees that inform species relationships, even if the species in question isn’t known. In other words a phylogenome can both reliably identify an unknown sample, and also identify if it is a new species.

More excitingly, I think we need to have a deeper look at how we both construct and analyse evolutionary models. Life on earth can be accurately and fully described best by a network, not a bifurcating tree, but this applies to loci as well as single genes. In other words, there is a single network that connects every locus in every living thing. Phylogenetic trees are only a bifurcating projection of this, while single- or multi-locus networks only comprise a part.

We’ve hitherto ignored this fact, largely because (a) trees are often a good approximation, especially in the case of eukaryote nuclear genes, and (b) the data and computation requirements a ‘network-of-life’ analysis implies are formidable. However, cracks are beginning to appear, in both faces. Firstly, many loci are subject to real biological phenomena (horizontal gene transfer, selection leading to adaptive convergence, etc) which give erroneous trees as discussed above. Meanwhile prokaryotic and viral inference is rarely even this straightforward. Secondly, expanding computing power, algorithmic complexity, and sequencing capacity (imagine just 1,000 high schools across the world, regularly using a MinION for class projects…) mean the question for us today really isn’t ‘how do we get data’, but ‘how ambitious do we want to be with it?’

Since my PhD but especially since 2012, I’ve been developing this theme. Ultimately I think the answer lies in the continuous analysis of public phylogenomic data. Imagine multiple distributed but linked analyses, continuously running to infer parts of the network of life, updating their model asynchronously both as new data flood in, and by exchanging information with each other. This is really what we mean by real-time phylogenomics – nothing less than a complete Network of Life, living in the cloud, publicly available and collaboratively and continuously inferred from real-time sequence data.

So… that’s what I plan to spend the 2020s doing, anyway.


Posted in Science | Tagged , , , , , | Leave a comment

Some aspects of BLASTing long-read data

Quick note to explain some of the differences we’ve observed working with long-read data (MinION, PacBio) for sample ID via BLAST. I’ll publish a proper paper on this, but for now:

  • Long reads aren’t just a bit longer than Illumina data, but two, three, four or possibly even five orders of magnitude longer (up to 10^6 already, vs 10^2). This is mathematically obvious, but extremely important…
  • … the massive length means lots of the yield is in comparatively few reads. This makes yield stats based on numbers of reads positively useless for comparison with NGS. Also…
  • Any given long read contains significantly more information than a short one does. Most obviously the genomics facilities of the world have focused on their potential for improving genome assembly contiguity and repeat spanning (as well as using synteny to spot rearrangements etc) but we’ve also shown (Parker et al, submitted) that whole coding loci can be directly recovered from single reads and used in phylogenomics without assembly and annotation. This makes sense (a ~kb long read can easily span a whole gene, also ~kb in scale) but it certainly wasn’t initially obvious, and given error rates, etc, it’s surprising it actually works.
  • Sample ID using BLAST actually works very differently. In particular, the normal ‘rubbish in, rubbish out’ rule is inverted. In other words, nanopore reads (for the time being) may be long, but inevitably contain errors. However, this length means that assuming BLAST database sequences are approximately as long/contiguous, Nanopore queries tend to either match database targets correctly, with very long alignments (hundreds/thousands of identities), or not at all.

This last point is the most important. What it means is that, for a read, interpreting the match is simple – you’ll either have a very long alignment to a target, or you won’t. Even when a read has regions of identity to more than one species, the correct read has a much longer cumulative alignment length overall for the correct one. This is the main result of our paper.

The second implication is that, as it has been put to me, for nanopore reads to be any good for an ID, you have to have a genomic database. While this is true in the narrow sense, our current work (and again, this is partly in our paper, and partly in preparation) shows that in fact all that matters is for the length distribution in the reference database to be similar to the query nanopore one. In particular, we’ve demonstrated that a rapid nanopore sequencing run, with no assembly, can itself serve as a perfectly good reference for future sample ID. This has important implications for sample ID but as I said, more on that later 😉

Posted in Coding, Science | Tagged , , , , , | Leave a comment

Only Corbyn can save the Left

Labour go into this election with the dice stacked against Corbyn, as the Tories intended. But here’s the thing – he’s only an election liability *if* you believe he can win.

But in fact everyone – surely even the man himself – can see there is *no chance* of Corbyn being the next PM, even in coalition.

On the 8th June the nation will choose not ‘May or Corbyn’ but ‘big or humungous Tory majority’.

The Lib Dems can’t win either, and again, the whole world knows it. So Brexit is most definitely happening.

Those two facts mean the left can neutralise Tory scaremongering on a ‘PM Corbyn’ or ‘Brexit backsliding’. That’s how we move the conversation on to what kind of country we want. There the Tories are on a much weaker foundation. The fact is, NINE years after the banking crisis, and seven years after they took power, the Tories have cut and wrecked at every opportunity, the longest, most savage swipe at living standards in memory. They will keep on, and on, and on at our pockets because they are ideologically unable to think of anything else.

So if the left can only acknowledge that no, they can’t win this time, no, they can’t stop Brexit, and no, Corbyn won’t be PM, they can turn on to arguments that are winnable. Better yet, a tacit pact to collaborate (perhaps supporting Compass to produce a social-media-friendly ‘tactical voting app’ based on postcodes or similar) would lay the necessary foundations for a proper power grab in 2022 – when the Tories will have been in power for over a decade.

I suspect this would be the Tories’ worst nightmare. May’s gamble would completely backfire – winning the election (narrowly) but losing the national argument.

The key is only Corbyn, or those close to him, can trigger this. Perhaps the Easter significance will inspire them…

Posted in Activism | Tagged , , , , , | Leave a comment

Step-by-step: Raspberry Pi as a wifi bridge, plus a (really) low-spec media centre…

I’ll keep this brief, really so, because this is mainly an aide-memoire for when this horrific bodge breaks in the next, ooh, month or so. But, for context:

The problem:

Our office/studios are in a shed at the bottom of the garden (~15m). Wifi / wireless LAN just reaches, intermittently.

The solution:

Set up an ethernet network in the shed itself, and connect (‘bridge’) that network to the house wifi with a Raspberry Pi.


1x Raspberry Pi (Pi 2 Model B; mine overclocked to ~1150MHz) plus SD card and reader; an old ethernet switch and cables; quite a lot of patience.

A bit more detail:

This step-by-step is going to be a bit arse-about-face, in that the order of the steps you’d actually need from scratch is completely different from the max-frustration, highly circuitous route I actually followed. Not least because I already had a Pi with Ubuntu on:

  1. Get a Pi with Ubuntu on it. This will be acting as the wireless bridge to connect the LAN to the wifi; and also serve IP addresses to other hosts on the LAN (network buffs: yes, I realise this is a crap solution). This is the second-easiest step by a mile; see: this guide for MATE and follow it. We’ll set the Pi up to run without a monitor or keyboard (‘headless’ – connecting over SSH) later, but for now don’t ruin your relationship unduly, do this bit the easy way with a monitor attached.
  3. apt-get update the Pi a few times. You’ll thank yourself later.
  4. Set the Pi up to act as a wifi <–> LAN bridge. There are a lot of tutorials suggesting various ways to achieve this such as this, this, and all of this noise. But ignore them all – with the newest Ubuntu LTS (16.04 at time of writing) this is now far, far, far easier to do in the GUI, and more stable. Just follow this guide.
  5. Set up some other housekeeping tasks for headless login: enable SSH (see also here); set the clock to attempt to update the system time on boot if a time server’s available (make sure to add e.g. server to your /etc/ntp.conf file) and login to the desktop automatically. This last action isn’t necessary, and purists will claim it wastes resources, but this is a Pi 2 and we’re only serving DCHP on it, basically – it can afford that. The reason I’ve enabled this is because it seems to force the WLAN adapter to try to acquire the home wifi a bit more persistently (see below). I’ve tried to achieve the same results using wpa_supplicant, but with no stability and my time is a pretty finite resource, so screw it – I’m a scientist, not an engineer!
  6. Lastly, I’ve made some fairly heavy-duty edits (not following but at least guided by this and this) to my /etc/network/interfaces file, with a LOT of trial and error which included a couple of false starts bricking my Pi (if that happens to you, reinstall Ubuntu. Sorry.) It now reads (home wifi credentials redacted):
    # interfaces(5) file used by ifup(8) and ifdown(8)
    # Include files from /etc/network/interfaces.d:
    source-directory /etc/network/interfaces.d# The loopback network interface
    auto lo
    iface lo inet loopback

    # LOOK at all the crap I tried...
    #allow-hotplug wlan0
    #iface eth0 inet dhcp
    #allow-hotplug wlan0
    #iface wlan0 inet manual
    #iface eth0 inet auto
    #wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf
    # Yep, that lot took a while :\

    # Finally, this worked:
    auto wlan0
    iface wlan0 inet dhcp
    wpa-ssid "q***********"
    wpa-psk "a******"
    # That's it :D
  7. Connect the Pi to your other computers using the switch and miles of dodgy ethernet cabling.
  8. Disconnect the screen, reboot, and wait for a long time – potentially hours – for the Pi to acquire the wifi. You should now be able to a) ping and/or login to the Pi from other hosts on the LAN, and b) ping/access hosts on the home WLAN, and indeed the wider Internet if your WLAN has a connection(!)

A Media centre from scratch

Lastly of all, having gone to all that trouble, the glaring bandwidth inadequacies of our crap WLAN showed up. Being stingy by nature (well, and because the phone companies in our area insist that, despite living fewer than a day’s march from Westminster, their exchanges have run out of fibre capacities for 21st-century broadband) I decided to mitigate this for the long winter months the simplest way: gather the zillions of mp3s, ripped DVDs and videos from all our devices onto one server. I put an Ubuntu (the same 16.04 / MATE distribution as on the Pi, in fact) onto an old Z77 motherboard my little brother had no use for, in an ancient (~2003) ATX case, with a rock-bottom Celeron new CPU (~£25) plus 4MB SDRAM and cheap spinning drive I had lying about (a 2TB Toshiba SATA, IIRC). This is highly bodgy. So much so, in fact, that the CPU fan is cable-tied onto the mobo, because the holes for the anchor pins didn’t line up. But: it works, and only has to decode/serve MP3s and videos, after all.

I apt-get updated that a few times, plus adding in some extra packages like htop, openssh, and hardinfo – plus removing crap like games and office stuff – to make it run about as leanly as possible. Then, to manage and serve media I installed something I’d wanted to play with for a while: Kodi. This is both a media manager/player (like iTunes, VLC, or Windows Media Player) and also streaming media server, so other hosts on my LAN can access the library by streaming over the ethernet if they want without replicating files.

Setting up Kodi was simplicity itself, as was adding movies and music to the library, but one minor glitch I encountered was reading/displaying track info and artwork, which usually happens on iTunes pretty seamlessly via ID3 tags, fingerprinting, and/or Gracenote CDDB querying. Turns out I’d been spoilt this last decade, because in Kodi this doesn’t happen so magically. Instead, you need to use a tool like MusicBrainz Picard to add the tags to MP3s on your system, then re-scan them into Kodi for the metadata to be visible. The re-scanning bit isn’t as onerous as you’d think – files are left in place, the ID3 tags being used simply to update Kodi’s metadata server (I guess) – but the initial Picard search for thousands of MP3s over a slow WLAN took me most of a night.

However. A small price to pay to actually have music to listen to while I work away writing crap like this in the shed, or shoddy-quality old episodes of Blackadder or Futurama to watch in the evening :p

Posted in Blog, Coding | Tagged , , , , , | Leave a comment

‘Stretched resources’ applies to parliamentarians, too…

I listened to a sad story on BBCR4 today this morning – a grieving mother can’t bury her daughter, murdered, because there’s no body – in years of searching none has been found. The killer isn’t co-operating (they don’t have to, although it would improve their parole terms to do so). She wants a change in the law so murderers can’t get parole until a body is produced (habeas corpus, literally).

This is a worthy campaign, and it must blight her life. Thing is, this scenario affects ’70 whole families’, by her own numbers. Just 70 in the whole of the UK. A change in the *law* for this? A law which has to go through parliamentary scrutiny (twice), occupying time and resources.

Couldn’t sentencing just be updated, instead?

We have a vast number of small, tiny, individually important laws but are they collectively eating away at the vitality of our democracy? MPs need to be wrestling with and thoroughly, openly, debating the massive challenges of our time – automation, climate change, ageing, food security, migration. Most complain of long hours. Not every minor cause is lucky to have an effective MP to champion it, either – which ‘good ideas’ make it into law is arbitrary, in this sense. And finally: should we have hundreds of such minor bills on the book?

Or a simpler legal code, with more judges, able to devote more time to judicious sentencing, and a fast effective appeals process for victims and the convicted if they feel sentences and parole are unjust?

Posted in Activism, Blog | Tagged , , , | Leave a comment

Using field-based DNA sequencing to accelerate phylogenomics

Invited seminar at the Department of Zoology, Oxford University, 30th November 2016.

Summary of our field-based real-time phylogenomics (MinION DNA sequencing) experiments this year, and applicability to broad-scale tree-of-life phylogenomics and macroevolutionary biology.

Slides [SlideShare]: cc-by-nd

Posted in Publications, Talks | Tagged , , , , | Leave a comment

Science and (small) business

Over the last 10-20 years there’s been a revolution in academic science (or should that be ‘coup’?) where many aspects of the job have been professionalised and formalised, especially project management but management in general. This generally includes tools like GANTTs, milestones, workload models, targets and many other things previously unmentionable in academia but common in industry, especially large organisations. Lots of academics will tell you they think it’s bureaucratic overkill, intrusive, a waste of time, and worse (to put it mildly) but the awkward truth is that, as lab groups steadily increased in size (as fewer, larger grants went to increasingly senior PIs or consortia) many of the limitations of the collegiate style of the past, centred on a single academic with a tight-knit group, have been exposed.

Frequently the introduction of ‘management practices’, often after hiring expensive consultants, is accompanied by compulsory management training. Sometimes it can be an improvement. More normally (in my experience) whether an improvement in outcomes (as distinct from ‘efficiency’) has been achieved probably depends on whether you cost in staff time (or overtime) and morale. You can make arguments either way.

But I can’t help thinking: why are we attempting to replicate practices from big/massive private sector organisations, anyway? I suspect, the answer in part is because those are the clients management consultants have the most experience working with. More seriously, those organisations differ in fundamental respects from even the largest universities, let alone individual research projects. This is because large companies:

  • Add value to inputs to create physical goods or services that are easily costed by the market mechanism (this is the big one)
  • Usually have large cash reserves, or easy access to finance (tellingly when this ends they usually get liquidated)
  • Keep an eye on longer-term outcomes, but primarily focus on the 5-10 year horizon
  • Compete directly with others for customers (in some respects an infinite resource)
  • Are answerable, at least, yearly, to shareholders – with share value and dividends being the primary drivers of shareholder satisfaction.

Meanwhile, universities (and to an even greater extreme, research groups/PIs):

  • Produce knowledge outputs with zero market value*
  • Live hand-to-mouth on short grants
  • Need long-term, strategic thinking to succeed (really, this is why we get paid at all)
  • Compete indirectly for finite resources grants and publications, based partly on track record and partly on potential.
  • Answer, ultimately, to posterity, their families, and their Head Of Department

I want to be clear here – I’m not saying, by any means, that previous management techniques (ie, ‘none’) work well in today’s science environment – but I do think we should probably look to other models than, say General Motors, or GlaxoSmithKline. The problem is often compounded because PIs have no business experience (certainly not in startups) while consultants often come from big business – their ability to meet in the middle is limited.

Instead small and medium enterprises (SME)s are a much closer model to how science works. Here good management of resources and people is extremely important, but the scale is much smaller, permitting different management methods, often focussing on flexibility and results, not hierarchies and systems. For instance, project goals are often still designed to be SMART (specific, measurable, achievable, realistic, time-scaled) but these will be revisited often and informally, and adjusted whenever necessary. Failure is a recognised part of the ongoing process. This is the exact opposite to how a GANTT, say, is used in academia: often drawn up at the project proposal (design) stage, it is then ignored until the end of the grant, when the PI scrabbles to fudge the outcomes, goals, or both to make the work actually carried out fit, so they don’t get a black mark from the funder/HoD for missing targets.

There are plenty of other models, and they vary not just by organisation size/type (e.g. tech startup, games studio, SkunkWorks, logistics distributor, niche construction subcontractor) but you see what I mean: copying ‘big business’ wholesale without looking at the whole ecosystem of business practices makes little sense…

*Obviously not all, or even most, scientific output will never realise any economic value – but it can be years, or centuries removed from the work to create it. And spin-outs are relevant to a tiny proportion of PI’s work, even in applied fields.

Posted in Activism, Blog, Science | Tagged , , , | Leave a comment

Making progress down the road

Too many laws and customs of driving make speed more important than safety, from the driving instructors’ “make good progress down the road” (e.g. “hurry the fuck up”, which most drivers internalise as “drive at least as fast as the speed limit unless there’s literally another car right in front of you”), to every transport investment ever being marketed to (presumably furious) taxpayers as “reducing journey times”.

This is in contrast to other European countries, where safety is #1, and speed just a nice-to-have. Surely it’s time for the national Government to admit – as London’s TfL have – that the UK is blessed with only a fixed amount of road space, so with growing numbers of people using it, we all have to accept that journeys will get slower in future, not quicker.

We have a real blind spot (pun intended) in the UK about traffic jams. On the one hand, we are only too aware of all the time we **WASTE** sat in stationary traffic each day – most car journeys are fewer than five miles, made by commuters, and involve up to half that time in queues – so traffic jams are a fact of driving life here in the UK.

On the other hand, peoples’ frustration / anger / surprise about being stuck in a traffic jam on any given morning (when they are, every morning) is total. But this is bizarre… We know the traffic will be there, but still get in our car expecting a free road, at 08:30 on a weekday! Where’s all that traffic come from!

Surely it’s time to admit traffic jams exist, will get worse, not better, and constantly lurching from 0 to 30mph and back again is pointless as well as dangerous?

Imagine a world where the DoT’s published targets and main priority were to reduce accidents per mile travelled, and included walking and cycling targets, not journey times? Where 20mph became the standard urban default speed limit, not exception? Where satnavs routinely pointed out to users when (given traffic conditions) particular journeys, short and long, were quicker by public transport / foot / bike?

A safer UK. A calmer UK. And – just possibly – a healthier, richer, and happier UK.


Posted in Activism, Blog, Cycling | Tagged , , , , , | Leave a comment

Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology applications

A short presentation to the British Society for Plant Pathology’s ‘Grand Challenges in Plant Pathology’ workshop on the uses of real-time DNA/RNA sequencing technology for plant health applications.

Doctoral Training Centre, University of Oxford, 14th September 2016.

Slides [SlideShare]: cc-by-nc-nd

Posted in Publications, Science, Talks | Tagged , , , , | Leave a comment