Conference season thoughts #3 – Hosting a conference

This is part 3 in a series of posts this summer conference season. It isn’t aimed at one particular LOC – I know how hard they all work – but intended as a general reflection.

I’ve been to god-knows how many academic conferences in the last 15-20 years, but I can only really remember a handful (partly due to the post-conference booze, but…) Some stand out for great reasons, others for terrible ones. So I’ve put together a few thoughts I keep in mind when organising a meeting:

Remember your purpose

Some conferences are run as tightly-knit events, where a small, well integrated field essentially perform business type duties of exchanging information and doing deals on collaborations, etc. Others function more like a massive science-bazaar, with a really wide range of talks. Some are focused on an exciting emerging theme, or raise awareness; others are an opportunity to laud an eminent scientist retiring, or to provide students and early-career researchers with an exciting platform. Whatever your purpose, make sure you’re agreed on it, as a committee. Keep your purpose in mind throughout the planning, delivery, and debrief after the conference to make sure you stay focused. For instance, employing a grad student to live-blog the conference (see below) is fairly pointless if literally every scientist in that field is there in the room…

Take the logistics seriously

Your committee is likely to be based in the city/campus where the conference is to be held. This gives you advantages when it comes to organising the event logistics (travel, accommodation, food) but unfortunately, it also means you’re liable to miss some elements which will be vital for others:

  • How will people get to the conference? Is there public transport? Bike parking? Are people likely to try and drive there? What about pedestrians? And…
  • Will any of your attendees have particular physical needs around accessibility (spoiler: almost certainly)?
  • Will the food be acceptable to diverse cultures and diets? Priced sensibly for grad students, including those from developing nations?
  • Similarly, the accommodation should cater for a wide range of attendees’ needs and budgets.
  • Will the conference pricing and customer support work well? Can it cope with loads at registration/payment deadlines? Who will provide ‘customer’ support to colleagues with queries, and how? Are you collating an FAQ as you go?
  • How will abstracts be collected and published?
  • What will your social media policy be (see below)?
  • Finally: brief all your staff and volunteers thoroughly!!! There’s nothing worse than seeing an enthusiastic student put on the spot because (for instance) shirty delegates can’t login to the wifi, or realise that the conference programme has been amended from the printed one.

Local entertainment and culture

Many conferences seek to showcase the local culture and/or nightlife. This can be a fantastic addition to the conference, but make sure you plan it ahead, and be inclusive. If drinking loads of alcohol is a big part of the activities, for instance, make sure teetotal attendees are catered for.

Also, while the temptation is for the prime-movers in the conference committee or society to use the conference as an opportunity to socialise amongst themselves, this can rapidly become cliquey and exclude others. Make sure you circulate and socialise with your delegates (some who may have travelled a long way, perhaps to meet you) and even consider specific icebreaking activities.

Delegate safety

This is key. You do have a duty of care over your attendees. If your conference is in an inner city and there are events planned after dark, maybe don’t get your delegates pissed, give them all bright orange rucksacks labelled ‘mug me’ and send them off into the night…

Social media

Social media and liveblogging are really brilliant additions to the conference experience these days. It gives an opportunity for colleagues to meet like-minded scientists; for the more shy to express their opinions in a safe space online; to catch up on parallel sessions; for those unable to attend to follow along; to advertise your next event; and simply to disseminate information such as programme changes. However you must think about a few things:

  • Will you aim to keep a record or archive online of the conference? If so are you hoping this will happen naturally? Or can you ask a student or ECR to take responsibility?
  • Will you have a twitter account and hashtag for the conference? If so, make sure the hashtag isn’t taken(!)
  • Make sure the conference house rules for social media are understood in advance e.g. can slides, videos etc be shared or not?

Scientific content and chairs

Ultimately the science is why we’re here, right? So make sure you pick willing and competent chairs for your sessions. These should be serious-minded, but sociable people who are going to take their responsibility seriously and make sessions flow as smoothly, interestingly, and inclusive-ly (sorry) as possible. Don’t pick chairs just because you can call them easily, or because they are big wings in the field. Try really, really hard to maximise diversity amongst your chairs (so delegates feel welcome), and don’t forget that even relatively junior early-career researchers might make fantastic chairs, depending on their experience. They’ll appreciate the opportunity far more than some introverted old misanthrope who doesn’t even want the job – and make much more of it.

As well as the chairs, the opening and closing remarks should be an opportunity to set the science and tone for the debate. You’ll wish to talk about scientific issues, of course – but the main point of opening and closing remarks is to kick a conference off in the right spirit and summarise highlights (this extends to individual sessions – and you may want to have opening/closing remarks each day, too, for long programmes).

Above all, make sure your chairs all understand that you are trying to foster a lively, open, an inclusive debate throughout the conference – not a narrow, cliquey experience, or a toxic, personalised, hyper-macho slagging match.

Thank everyone

Above all, thank everyone – from the attendees, students, AV techs and soundmen up to your speakers, chairs, and committee!

Posted in Science | Tagged , , | Leave a comment

Conference thoughts #2: Chairs, earn your ticket!

There are good chairs and bad. Surprisingly (or not, this is academia, after all..) there’s little guidance. I’ve put together a list:

  1. Start as you mean to go on. Remind speakers to keep to time (see next point) and encourage questions from all attendees (see point 3)
  2. Control time with an iron fist – and let speakers know in advance that you plan to. Letting the big old dogs in your field ramble on at the expense of others’ time is unfair to younger researchers. Introducing yourelf to speakers early on email is a great place to mention this and it also kicks the relationship off well
  3. Do you want to create a lively, inclusive spirit of debate? If so, lead the way –
  4. Do your homework. Simply reading the speaker’s title and name from their slide is disrespectful (how would you feel?)
  5. Take responsibility for the sound. You are probably nearest the speaker stage, so if the sound is muffled, or inaudible to you then others will be really stuggling. if a mic has been turned off, or not clipped on, step in, apologise, and rectify it.
  6. Pay attention during every talk, not just the ones your mate’s giving. This is because you need to…
  7. …. make sure you have a reserve question for every speaker, in case there aren’t any. This is especially true for students, who will have put in a lot of work and stress to give yheir first international presentation.
  8. You are the chair. Police the questioning accordingly. If the same people are crowding other, junior, less confident questioners out, simply pass over them or gently but firmly make sure they wrap up. Make sure all questions can be heard, clearly, by the audience.
  9. Stay impartial. It may be more useful to imagine yourself an editor, not contributor, to the debate.
  10. If a debate turns toxic, cut that shit out.
  11. Finish by thanking all speakers and the audience.
Posted in Science | Tagged , , , , , , , , | Leave a comment

Conference thoughts #1: PIs, support your students!

Summer conference season is nearly over. This is the first of three posts, informed by some reflections about the nature of scientific conferences.

Students often feel under a lot of pressure when giving their first public presentations. PIs, for whom a conference is as much a chance to strike deals and go drinking with old mates from grad school, usually don’t. This is a mistake – who realistically expects the first-year grad student (who’s probably never been to an international scientific conference) to smash it and give a fantastic, polished, relevant, clear, concise talk?

Getting it wrong

Students, who may well only have experience of lectures on taught courses and lab meetings, usually approach their first conference talks in one of two ways: Firstly, the ‘incredibly nervous and over-written’ model. In this version, they burn midnight oil for hours on end, cramming everything but the kitchen sink into a 12 minute presentation, and writing a complete script running to many pages. In the second approach, the student (who is probably fairly bright, confident, and may have watched too many TED talks) is hopelessly underprepared, with perhaps a handful of raw graphs and only a vague grip on the facts of their experiment.

Needless to say: both students come easily and embarrassingly unstuck in the glare of a 200-seater-plus auditorium of professors…

Getting it right

Now, everyone reading this with even a couple of conferences under their belt will have recognised these two scenarios. You’re probably thinking ‘so what? Easy to fix – and there are dozens of help resources online. Plus – everyone has to dive in at the deep end: I did!’

My point is that, yes, these are easy to remedy – and it’s the PI’s job to do this. What does it say about your mentorship, if your students – who may have an interesting result, and for whom their first public talk marks a seminal coming-of-academic-age – can’t shine, not because of a fault of theirs, but simply because you haven’t imparted a clear idea of what is expected, and how to deliver it?

Whenever I see a young grad student floundering, I immediately look for the supervisor. They’re rarely even in the room, and I think this is telling. Supervisors – your students are not free labour! You have a deep responsibility to them as colleagues. So I would like to suggest the following:

  • Be clear. Make sure the students know, early on, what presenting entails; who the audience will be; and what standard is expected. Get them to attend seminars, not for the scientific content but to observe how the talk is presented.
  • Be hands-on: The point of a PhD/DPhil/DEng is to build independence, sure – but there’s a difference between a student’s first paddle in the academic ocean, and throwing them in. For this first talk, be their water wings: make sure you iterate a couple of versions of their talk, long before the conference, so you can give timely, accurate feedback.
  • Be rigorous: Above all, you have a responsibility for your trainee’s content. In the heat of the moment they may slip-up, or mis-speak. But never, never, never let your student present things which are flat-out factually wrong. At a recent presentation I watched in horror as a student (presenting some barplots with N=3) repeatedly described their results as ‘significant’. I had to resist the urge to drag their PI out of the hall then and there, because this student has obviously never had any feedback from their PI (not to mention training in stats – but that’s for another post), or worse still, the PI didn’t even notice the error…
  • Be supportive: A student’s first presentation is a major milestone in both their academic development – and yours. Make sure you’re there in good time to wish them luck beforehand, and stay to the end!

In conclusion

I’ve a feeling poor graduate student instruction stems from a series of equally rubbish experiences we all had many decades ago. It was sub-par then, and it’s sub-par in 2019. Support your students, give them a great learning experience, and you’ll reap the rewards!

Posted in Science | Tagged , , | Leave a comment

Why Trump is right on defence spending… and wrong

It’s all over the press this week: Trump wants all NATO members to spend 4% of their GDP on defence – a doubling of the previously-agreed 2% target (which many countries don’t meet). To me this seems pretty wrong, for two reasons, but not the ones you might imagine. These aren’t new arguments, but since there’s a grain of truth in Trump’s criticism (NATO, for all the money spent on it, has demonstrably failed to deter agression on its borders), let’s unpack this.

Firstly, it is entirely legitimate to ask whether there is a level of defence spending that ceases to be sustainable in any society. To think about this we need to stop thinking of national budgets like ‘household expenses’ (pub, Sky subscription, holidays), and think of national spending more as a business-type investment. This is because most governmental spending is either meeting our national ‘running costs’ (recycling rubbish, paying the police to keep law and order), or a genuine investment (education improves skills and so wellbeing and productivity; health spending keeps more workers active longer; new railways get people to work faster and more reliably, etc.)

Defence spending is different, since it is essentially a form of insurance, e.g. an expense we incur not because we want to use tanks, ships, bombers on a daily basis, but because we hope that by having them, we can (let’s be honest) intimidate other nations into helping meet our foreign policy goals (or at least, not blocking them). Deterrent nuclear weapon systems are the ultimate example of this since – unlike an amphibious carrier, say, or a transport aircraft – they can’t be used for humanitarian work, or even counter-terror operations.

So is 4% of GDP sustainable for all NATO countries? Arguably not, or at least, not as equally. NATO is now a large organisation, and includes:

  • small post-communist states like Bulgaria and Slovakia;
  • larger mid-table nations such as Spain and the Netherlands;
  • G7 members like the UK, France, Spain and Germany;
  • the USA, a global hyperpower with self-imposed military commitments in every part of the world, currently engaged in the start of a (hopefully slow) relative decline in power compared to China; and outliers like
  • Turkey (a massive country with one of the lowest per-capita incomes in Europe, but the only NATO state with a shooting war within its own borders, at least intermittently, in Kurdistan)
  • and Canada (a geographically sprawling, rich nation with – let’s be honest – very few real direct foreign-policy competitors).

Even if we set aside the very differing nature of each countries’ political economy (Germany’s economy continues to motor along with export prices kept low by the Euro; Bulgaria is in dire need of investment in all sectors; the UK has stagnated for a decade with chronic underproductivity due to expensive housing stock and transport costs; the US continues to benefit from the dollar’s status as a global currency), there are clearly very different foreign-policy goals and motivations here.

Canada, for instance, could probably disarm their offensive military completely, retaining only a very small gendarmerie / expeditionary force – safe in the knowledge that the odds of an unprovoked US annexation occurring any time soon are minimal (any ratcheting of tension between the two neighbours would likely take decades to result in all-out war). On the other hand, the Baltic states not only have fresh memories of Soviet rule 28 years ago – many still remember the Nazi invasion of 1940, while the Great War and following Russian Revolution are also barely out of living memory. They are also effectively encircled by an increasingly hostile Russian autocracy, have no large territory into which they can withdraw to defend in depth, and have tiny populations – the total population of Latvia, Estonia and Lithuania, 6.1 million people, is barely half that of Moscow alone (11.8m). For NATO, successfully defending any of the Baltic states in the face of even a small Russian attack is probably impossible (Paul Mason has written well about this). I imagine the Estonian GHQ have many more sleepless nights than their Canadian counterparts.


There’s a second line of reasoning to Trump’s policy though, and it goes like this: whatever the defence spending target is, we’ve all agreed to meet it, and lots of you aren’t ‘paying your share’. On the face of it, this is more persuasive: if NATO is a club, shouldn’t all members pay the subscription fee?

Here we have to realise that not every dollar spent on defence by every country is created equal. Different countries have very different defence objectives, and economies of scale really matter. Military kit is really expensive – let’s take a look at some well-established costs for 3rd- and 4th-generation hardware. Remember, this isn’t the newest, shiniest, most-expensive-est stuff like drones, cyber capabilities, or energy weapons: just common-or-garden NATO workhorse kit of the sort that’s been in service for over three decades:

  • US F-16 fighter: $14,000-000 – $18,600,000
  • Italian-French-Moroccan FREMM multipurpose frigate: ~$500,000,000
  • Tomahawk cruise missile: $250,000 per munition
  • German Puma infantry fighting vehicle: €12,000,000
  • AH-64D Apache attack helicopter: $33,000,000

The 2% of GDP would mean, for say Bulgaria or Lithuania, a total kitty of about $1000m per year. Just one frigate (or a squadron of fighters) would eat up half of that sum in capital expenditure alone, before we even think about running costs! The key thing to understand here is that a modern military works as a set of integrated parts. It doesn’t matter how great your tanks are; without air cover you’re boned, immediately, if your enemy has anything approaching a decent ground-attack capability. Then again, airfields in turn need protection from bombing, while all militaries have a very long tail of logistics to keep soldiers fed (and bandaged), trucks fueled, tanks mainteined in working order, and so on. Generally speaking, managing to mount one soldier in frontline operations for every 3-4 in support/logistics/training is considered really good*.

The current furore about scrapping some of our existing amphibious (e.g. beach landing) capability in order to pay for the two Queen Elizabeth-class carriers, their aircraft, and associated carrier battle groups highlights the stark choices even a large power like Britain faces: do everything on a shoestring, or focus on some capabilities (special forces; cyber; expeditionary warfare) at the expense of others (a nuclear deterrence force; large land operations).

Smaller powers such as Belgium or Poland feel this tension even more keenly, but since each countries’ citizens (and generals) want to feel their military can protect them on its own, without needing others’ help, we find that the European militaries consist of a succession of small-unit forces, e.g brigades and regiments, not division-scale forces – each with its own tankers, training institutions, intelligence corps and headquarters. This is incredibly wasteful, but predictable since the illusion of national independence (as opposed to interdependence) relies on the presumed ability  to act alone if need be.

There is a way to address some of this redundancy of course: have each nation contribute select components of a modern military capability, rather than trying to cover all bases. In extremis, this could mean a completely integrated NATO army where the Germans and Poles provided the massed land formations; the UK and French the surface fleet; the Hungarians the tankers, and so on – but while there is a long history of attempting to do this (the famed “NATO 7.62mm / 5.56mm” small-arms calibres, for instance, or the Franco-German Brigade) the reality is the people of NATO are, at present, more attached to national glory than military effectiveness – who wants to parade a squadron of petrol tankers on Armistice Day, anyway? Plus, this is just the sort of sovereignty-pooling Euro-nonsense many on the right recoil at – let alone Trump and his neo-isolationist MAGA crowd, happily powerful enough that this sort of impossible choice never confronts them.

In other words, throwing more money at the problem – and harming the long-term economic health which underpins any power-projection by so doing – without first reforming NATO so the dollars/euros we commit are spent effectively, without duplication and waste – is economic and military insanity. But I suspect we’ll muddle on – until, or unless – an external action forces NATO to wake up and smell the kvass.


*For instance, the UK armed forces managed to deploy forces in Iraq and Afghanistan for most of the last 15 years which peaked at around 10,000 in Afghanistan, and 46,000 in Iraq (the invasion itself). Sustaining this level of commitment (~30,000-50,000 troops, of a total UK Armed Forces headcount of around 250,000 personnel – including 80,000 reservists – e.g. about 1:4 deployed:support troops) over this length of time stretched the UK’s capability almost to breaking point. And the UK / France force size is probably about the smallest it’s possible for a modern military to be while still retaining the ability to conduct independent operations (e.g. without the Yanks) anywhere in the world at short notice (Libya arguably showed even this premise to be shaky – we relied on some US support, and acted in concert with France).

Posted in Blog | Tagged , , , | Leave a comment

I’m a programmer – and driverless cars scare the hell out of me

Tech-savvy developer types often ride bikes, and often instinctively back the idea of robot vehicles. After all, the subconscious asks, if I can code a computer to play a game, what’s so hard about getting it to move around a real map?

But driverless cars are not like ‘normal’ AI. They exist in a world with all-too real consequences. For the first time we’re asking billions of people to cede control of a lethal machine to a still-highly-experimental, incredibly complicated, autonomous system. Noise, edge cases and unusual behaviours not present in training data all abound. Autonomous vehicles will be more dangerous for cyclists and pedestrians, not less.

Bikes and people can suddenly reverse, or move sideways, or in diagonally. They can even jump! This means to truly safely deal with these types of users – call it the ‘Friday Night In Peckham Test’ – is a much bigger challenge than a motorway, where three lanes of users move in predictable patterns, and all between 40-80MPH.

There’s a reason car manufacturers are testing away from towns, and won’t share testing data, or test specifications. They’d rather not have to deal safely with pedestrians and bikes because they know how much more costly they are. So car-makers would rather they went away. They’ll probably get their wish because they’re marque ‘heavy industry’ employers, and because AI belongs to the sleek shiny optimistic future of tech that all politicians are desperate to court.

Instead we will see two worrying developments to shift responsibility from car makers. There will be pressure to gradually remove troublesome bikes and pedestrians from parts of the roads ‘for their own safety’. We’ve already started to see this.

Alternatively manufacturers will introduce two levels of AI – a truly autonomous ‘safe mode’ which overreacts to all stimuli with infuriatingly (unworkably) slow speeds, or a ‘sport setting’ which makes much riskier decisions to enable faster speeds, under the flimsy caveat that users ‘monitor the system more actively’. Most will prefer to travel faster but few will be bothered to supervise the AI closely or consistently, when Netflix and social media are available distractions.

Finally, a growing body of evidence shows that too much automation of routine tasks can make them more dangerous, not less. Airline pilots’ skills atrophy when autopilot handles most of their flying time – with literally lethal consequences when an emergency occurs. Car drivers – most of whom already take driving far less seriously than such a dangerous activity merits – will suffer the same fate. Who would bet on a driver, suddenly handed control back in an emergency situation, taking the right decision in a split-second if they have perhaps hardly driven for years?

We’d just started to halt the decades long declines in cycling and walking rates, and lower urban speeds to 20MPH (‘twenty’s plenty’ initiatives), but the rise of autonomous vehicles will likely threaten this and see walking and cycling people relegated to third place, behind fleets of AI cars travelling bumper to bumper at far higher speeds than today and a few die-hard petrolheads defiantly navigating their own vintage tanks among the whizzing fleet.

One of the best things about my teenage years was romping around town on foot or bikes with my mates. My daughter turns 16 in 2030. It’s terrifying to think that by then, the killer robots that end her life might not be gun-toting drones, put plain old delivery vans.

Posted in Activism, Blog, Coding, Cycling | Tagged , , , , | Leave a comment

Rapid, raw-read reference and identification (R4IDs): A flexible platform for rapid generic species ID using long-read sequencing technology.

Our paper on rapid identification of samples using partial, low-coverage, MinION-sequenced reference databases for ID (at the Kew Science Festival) is in preprint. See here on BiorXiv: doi: 10.1101/281048.

In it, we show (with empirical data and simulation) that the length and bias of MinION reads makes them ideal for sample ID – better than NGS, under certain conditions – even where no reference assembly is available and a genomic skim is used to BLAST against, instead. Because these genome-skim-DBs are quick to generate, we call them ‘rapid-raw-reaad-reference fo ID’, or ‘R4IDs’ for short.

The code to repeat these analyses or set up a R4IDs analysis is on GitHub but I’ve also packaged this as Docker containers:

A few rough corners to sand while we decide where to submit it, but comments welcome in the meantime. Thanks, as ever to my colleagues (long-suffering Alex and Andrew) for all their stress:

Rapid, raw-read reference and identification (R4IDs): A flexible platform for rapid generic species ID using long-read sequencing technology.

Joe Parker* Andrew J. HelmstetterAlexander S. T. Papadopulos*


The versatility of the current DNA sequencing platforms and the development of portable, nanopore sequencers means that it has never been easier to collect genetic data for unknown sample ID. DNA barcoding and meta-barcoding have become increasingly popular and barcode databases continue to grow at an impressive rate. However, the number of canonical genome assemblies (reference or draft) that are publicly available is relatively tiny, hindering the more widespread use of genome scale DNA sequencing technology for accurate species identification and discovery. Here, we show that rapid raw-read reference datasets, or R4IDs for short, generated in a matter of hours on the Oxford Nanopore MinION, can bridge this gap and accelerate the generation of useable reference sequence data. By exploiting the long read length of this technology, shotgun genomic sequencing of a small portion of an organism’s genome can act as a suitable reference database despite the low sequencing coverage. These R4IDs can then be used for accurate species identification with minimal amounts of re-sequencing effort (<1000s of reads). We demonstrated the capabilities of this approach with six vascular plant species for which we created R4IDs in the laboratory and then re-sequenced, live at the Kew Science Festival 2016. We further validated our method using simulations to determine the broader applicability of the approach. Our data analysis pipeline has been made available as a Dockerised workflow for simple, scalable deployment for a range of uses.

Posted in Journals, Manuscripts in preparation, Publications, Science | Tagged , , , , | Leave a comment

Field-based, real-time metagenomics and phylogenomics for responsive pathogen detection: lessons from nanopore analyses of Acute Oak Decline (AOD) sites in the UK.

Talk presented at the UK-India Joint Bioinformatics Workshop, Pirbright Institute, 09 Feb 2018


In a globalised world of increasing trade, novel threats to animal and plant health, as well as human diseases, can cross political and geographical borders spontaneously and rapidly. One such example is the rise of Acute Oak Decline (AOD) in the UK, a multifactorial decline syndrome with uncertain aetiology, vectors, and host risk factors first reported in the UK a decade ago. Affected oaks display significant morbidity and mortality, with symptoms including vascular interruption, crown loss and characteristic striking bark lesions breaching cambium and filled with a viscous, aromatic, dark-brown/black exudate, which may sometimes be released under considerable pressure. Although multiple bacterial species have been associated to lesion sites in affected oaks, and a putative insect vector identified, the basic risk factors, transmission, progression and treatment of the syndrome remain unclear.

This dispiriting state of affairs presents an ideal opportunity to exploit recent developments in nanopore sequencing to develop and test field-based methods of real-time phylogenomics and metagenomics to establish baseline data for healthy oaks, and contrast these with affected / dying oaks to shed light on syndrome causes and management. WGS metagenomic sampling was carried out on leaf and bark tissue from 37 affected, asymptomatic, and recovering individuals (nine Quercus species) at three field sites over a year. Extraction and DNA sequencing were performed in the field for a subset of samples with MinION nanopore rapid sequencing kits, and also using MinION and paired-end Illumina sequencing under laboratory conditions. Metagenomic analyses to determine microbial community composition were carried out, and real-time phylogenomic methods were also developed and applied. Early results from these analyses and lessons for future work are presented.

Metagenomic datasets can be rapidly generated in the field with minimal equipment using nanopore sequencing, providing a responsive capability for emerging disease threats and reducing transmission risks associated with transporting quantities of potentially infectious samples from outbreaks of novel diseases. Furthermore, real-time data analysis can provide rapid feedback to field teams, both to inform management decisions and also to allow for adaptive experimental protocols that dynamically target data collection to extract maximum information per unit effort.

Posted in Publications, Science, Talks | Tagged , , , , , , , , | Leave a comment

Real-time phylogenomics or ‘Some interesting problems in genomic big data’

Talk given at a technology/informatics company, London, Feb 2018.

An overview of contemporary advances and remaining problems in big-data biology, especially phylogenomics.

Posted in Publications, Science, Talks | Tagged , , , , , , , | Leave a comment

Read all about it!

Dead excited to say our Nature Science Reports paper on field-based DNA extraction, sequencing (and a bit of analysis) has been picked up by the BBC World Service and The Times (UK) newspaper! You can read all about it here (paywall).

If you can’t read it online, my Grandad has a copy he might lend you. We’re proper scientists now…

Posted in Publications, Science | Tagged , , , | Leave a comment

Tent-seq: the paper (aka ‘field-based, real-time phylogenomics’)

Really proud to report that the first of our bona fide real-time phylogenomics papers is now out in Scientific Reports!

In the paper we managed to show a number of things that are potentially really exciting, and I’ll get to them in a minute. First though, this is the first paper I’ve published where I got to drive every part: from conceiving the idea (with Alex) to getting funding, planning and carrying out the fieldwork/field-sequencing (with Alex and Dion), sequencing (all by Andrew) and analysis and writing (everyone). This was incredibly satisfying as normally a lot of my time is spent analysing downstream data. I feel like a proper grown-up scientist now. More please!

Firstly, what did we actually do? Pretty simple really:

  • Over a week (May 2016) in Snowdonia National Park, Wales,
  • We collected the flower Arabidopsis thaliana and congeneric A. lyrata,
  • Extracted their DNA and prepared sequencing libraries for MinION sequencing, in a tent with no mains power or running water,
  • Sequenced both species using Oxford Nanopore MinION, and
  • Analysed them in real-time with BLAST databases held locally, building trees with a handful of genes.

Later on back in the lab we repeated the sequencing (but not extractions) with Illumina MiSeq, so we could compare the platforms, and also developed a few more sophisticated bioinformatics analyses. To be honest, most of the pipelines we ran could have run in real-time (and now do) but at the time of the main fieldwork we just didn’t expect it would work as well as it did!

Result 1: Genomic DNA sequencing with MinION is fairly easy, even in the middle of nowhere.

Seriously, depending on how much patience and practical skill you have, this is either easy, or really easy. We used the Oxford Nanopore 1D Rapid sequencing kit (disclaimer: actually a prototype ONT provided; though the COTS one is much better now) for sequencing, and extracted using the Qiagen DNEasy Plant Miniprep kit, modified with a longer initial incubation and double-concentrated cleaning step, but essentially unchanged. The MinION itself, as is well documented, runs off USB into a laptop.

Hardware-wise you’ll need:

  • Two waterbaths (or in our case, two polyboxes, a gas kettle, and some thermometers)
  • A centrifuge
  • A generator to run said devices
  • Some poly boxes with -20ºC and ice for reagents

… that’s it. If you’re looking at this list and thinking ‘I could get all that together by next weekend, maybe we should go on a sequencing trip’ well, that’s the idea :)

There’s a lot of refinements possible. A portable freezer will make life easier, as will a dedicated 12v supply for USB power and a portable DNA quantification tool like a Quantus. Plus, all of the above don’t really like rain so a tent and/or van (as with Nick Loman and Josh Quick’s Zika trip last year) will help out a lot. But to get started, that’s it.

Result 2: Long MinION reads are really good at species ID – even better than Illumina in key respects.

The core goal of this project was to work out “can field-based genomic WGS sequencing identify closely-related species?”. So we deliberately picked two species from the same genus with publicly-available reference genome sequences (A. thaliana A. lyrata). The ID process would be simple. For each of the four datasets (two MinION runs, one from each species, and two MiSeq 2x300bp paired-end runs, one for each species), we’d simply:

  1. Trim adapters from each read
  2. Match each read to the A. thaliana genome using the best hit with BLASTN
  3. Match each read to the A. lyrata genome, using the same method
  4. Compare the hit scores for each reference genome
  5. Un-blind the read (reveal which of the two species it actually came from)
  6. Score the read as a true or false positive or negative, depending on the result.

Clearly many reads finding a BLAST alignment for one species will also find a significant alignment to the other species, since these are separated by only a couple of MYr (and are pretty similar phenotypically). So if both hits are ‘significant’, how would we distinguish the best one? Intuitively it seems sensible that the longest match / most identities will be better, but what threshold length difference should we use? 1bp longer? 10bp? 100bp?

Happily the


package in R lets you investigate the performance of test statistics on known classifier sets. We used this to produce the plot below, which shows the effect of increasing threshold length difference on true-positive (TP), false-positive (FP) and accuracy rates for MinION (black) and MiSeq (red) reads:Figure_2c-2d_length

The really key thing here is that MinION reads are beating MiSeq ones at most length difference (bias) thresholds greater than ~5-10bp, right up to 300bp (the MiSeq inserts top out at this length, of course). This is important because while here we’re matching orthogonal ID cases (A. lyrata against A. thaliana, and vice versa), in a practical application we might have a third species without a reference but two possible matches, and while some loci will be closer to the first, others could match the second. So while a threshold of 1bp might technically be the best (TP rate of ~90% and close to 100% accuracy), we may want to raise the threshold to a much higher value (>50bp) and accept lower TP rates to get a better confidence.

Result 3: Species ID does not need complete genomes for reference databases, and even works with a handful of MinION reads.

One very obvious and sensible criticism of our early drafts of this was that the reference genomes we used to build BLASTN databases with are largely completely assembled. While there’s been some handwringing recently about the structural variation of these plant genomes at population level, most people accept that a high proportion of the informative sequences in these genomes are now well determined.

For most people, in most places, this will not be the case; for instance, there’s ~300,000-400,000 plant species described, but only ~180-250 public genomes. Most of those are of the fairly-low-coverage HTS WGS variety as well, so are pretty bittily assembled. Quite often these come from first-year baselining experiments in the ‘get some DNA and run it through the MiSeq then SOAP or Abyss’ mould, with N50 values in the low ‘000s.

So to test the effect of this, we artificially digested the reference genomes a few thousand times to simulate N50 values from about 100 (virtually unassembled) up to 10^6 (essentially complete), shown here in Fig 3a for N50 values from 10^0 up to 10^4, with accuracy scores calculated for a range of cutoff values:


These results were pretty promising, so finally we asked ourselves: OK, we had tens of thousands of MinION reads to make our ID with, generated over a day or so: but how few reads we would need to have a stab at a correct ID? Again, we jacknifed our dataset to find out, shown above in Figs 3b and 3c. Promisingly, you can see that by about 10^2-10^3 reads (in practice, an hour or less) the confidence intervals on our ID score barely budge. So, after an hour of sequencing, you’re likely to get as good an answer as you can get. One. Hour…!!!

Result 4: Field-sequenced WGS MinION long-reads substantially improve downstream genomics with low-coverage HTS data.

When planning the fieldwork we hadn’t really known what we’d get, in terms of read length, quality, or yield: this was a prototype kit that not many people had played with yet, let alone taken into the field. But about the time we were writing this up we found out about various genome assemblers optimised (supposedly) for ONT reads, chiefly Canu and hybrid-SPAdes. We decided to give it a whirl.. the results are pretty amazing!

Data MiSeq only MiSeq + MinION
Assembler Abyss hybridSPAdes
Illumina reads, 300bp paired-end 8,033,488 8,033,488
Illumina data (yield) 2,418 Mbp 2,418 Mbp
MinION reads, R7.3 + R9 kits,

N50 ~ 4,410bp

MinION data (yield) 240 Mbp
Approx. coverage 19.49x 19.49x + 2.01x
Assembly key statistics:
# contigs 24,999 10,644
Longest contig 90 Kbp 414 Kbp
N50 contiguity 7,853 bp 48,730 bp
Fraction of reference genome (%) 82 88
Errors, per 100 kbp:                           #N’s 1.7 5.4
# mismatches 518 588
# indels 120 130
Largest alignment 76,935 bp 264,039 bp
CEGMA gene completeness estimate:
# genes 219 of 248 245 of 248
% genes 88% 99%

Result 5: Individual MinION reads can be directly, individually annotated for coding loci with no assembly required.

By now everyone was getting a bit sick of me going on about MinION reads, but there was one final hunch I wanted to test: If reads are about the same length as nuclear coding loci (~5000-50,000bp), does that mean we can annotate individual reads to pull out coding sequences, and use them to build phylogenies? SNAP was a great tool for this, not least because it’s trained on A. thaliana gene models already.

I want to be absolutely clear here, as sometimes people seem to miss this: I’m not talking about assembling reads before annotation as usual. I’m not even talking about assembling them in real-time, then annotating. I mean, each time a read finishes basecalling, immediately try and annotate that single read, and only that single read, to try and get a coding sequence. 

In other words, how quickly can we turn a tube of DNA, into a folder of sequence reads, into alignments of coding loci? The answer is, ‘bloody quickly’:


The dashed line shows ‘all gene models’. The solid line shows unique CDS. The axis is the number of predicted CDS and yes, those are thousands – recall that the total number of CDS for A. thaliana is only about 23,000. The axis is, well, actual sequencing time (!)*

Now, not all of these are complete genes, and error rate means distinguishing paralogs robustly in a real case (e.g. completely novel genome) would be tricky, but on the other hand, this was a completely unoptimised pipeline, really just hacked together over a couple of weeks. There’s a lot of scope to improve this…

*These are the read timestamps, but it wasn’t a live run. I actually ran the analysis back in the lab afterwards, as my code was too buggy on the fieldwork day and I just lost my shit. But the CPU demands aren’t high – I can and have run this live subsequently.


Result 6: Predicted coding loci from individual MinION reads can be aligned to orthologous sequences, and multilocus phylogeny inferred in real-time.

I build trees. I build trees. I build trees for a living. Did you seriously think, that having got as far as spitting out thousands of novel coding loci per day in field-based sequencing, I wasn’t going to try some real-time, field-based, multilocus phylogenomic inference?


Here’s a *BEAST tree from 53 loci, all of the A. thaliana ones coming from directly-annotated, field-sequenced reads. Seriously 😉


As you’ve gathered by now, I’m enormously happy with this research. I think this paper is easily a bigger contribution to general science than our 2013 molecular convergence one because, if we’re honest, it’s an interesting but niche phenomenon, whereas literally everyone who uses or categorises any kind of biological material can benefit from this paper.

I’m indebted to my colleagues, Alex S. T. Papadopulos, Dion Devey, Andrew Helmstetter and Tim Wilkinson, and our funders, the Kew Foundation. We didn’t invent the MinION – that took hundreds of incredibly clever people at Oxford Nanopore years and millions of pounds of investment to do – but in this study we’ve managed to show all of the really transformative aspects of this technology working in the field, in real-time. There is no technical reason, at all why we shouldn’t all expect that within a decade, all of the analyses we currently run on DNA data in labs can run in the field, within minutes of collecting biological samples. And that really is something.

Posted in Science | Tagged , , , , , , , | Leave a comment