Category Archives: Science

Phylogenomic convergence detection: lessons and perspectives

Talk presented at the 18th Evolutionary Biology Meeting At Marseille (programme), 16th-19th September 2014.

(Powerpoint – note this is a draft, not the final talk, pending authorisation): EBMdraft

[slideshare id=41517262&doc=ebmjoeparkerconvergencefinal-recover-nosampling-141113102943-conversion-gate01]

Migrating to OS X Mavericks

The time has come, my friends. I am upgrading from 10.6.8 (‘Snow Leopard’) to 10.9 (‘Mavericks’) on my venerable and mistreated MacBook Pros (one is 2010 with a SATA drive, the other 2011 with an SSD). Common opinion holds that the 2010 machine might find it a stretch so I’m starting with the 2010/SSD model first. Also, hey, it’s a work machine, so if I truly bork it, Apple Care should (should) cover me…

Availability

At least Apple make the upgrade easy enough to get: for the last year or so, Software Update has been practically begging me to install the App Store. Apple offer OSX 10.9 for free through this platform (yes! FREE!!) so it’s a couple of clicks to download and start the installer…

Preamble

Obviously I’ve backed up everything several times: to Time Machine, on an external HDD; to Dropbox; Drobo; and even the odd USB stick lying around as well as my 2010 MBP and various other machines I have access to. As well as all this, I’ve actually tried to empty the boot disk a bit to make space – unusually RTFM for me – and managed to get the usage down to about 65% available space. I’ve also written down every password and username I have, obviously on bombay mix-flavoured rice-paper so I can eat them after when everything (hopefully) works.

Installation

Click the installer. Agree to a few T&Cs (okay, several, but this is Apple we’re talking about). Hit ‘Restart’. Pray…

Results

… And we’re done! That was surprisingly painless. The whole process took less than two hours on my office connection, from download to first login. There was a momentary heart attack when the first reboot appeared to have failed and I had to nudge it along, but so far (couple of days) everything seems to be running along nicely.

Now, I had worried (not unreasonably, given previous updates) that my computer might slow down massively, or blow up altogether. So far this doesn’t seem to have happened. The biggest downsides are the ones I’d previously read about and unexpected: e.g. PowerPC applications like TreeEdit and Se-Al aren’t supported any more. Apparently the main workaround for this is a 10.6.8 Server install inside Parallels, but I’ll look into this more in a future post when I get a chance.

was a bit surprised to find that both Homebrew and, even more oddly, my SQL installation needed to be reinstalled, but a host of other binaries didn’t. Presumably there’s a reason for this but I can’t find it. Luckily those two at least install pretty painlessly, but it did make me grateful nothing else broke (yet).

So what are the good sides? The general UI is shiny, not that this matters much in a bioinformatics context, and smart widgets like Notifications are pretty, but to be honest, there aren’t any really compelling reasons to switch. I’ve not used this machine as a laptop much so far, so I can’t comment on the power usage (e.g. stuff like App Nap) yet, although it seems to be improved… a bit.. and I haven’t had time to run any BEAST benchmarks to see how the JVM implementation compares. But there is one massive benefit: this is an OS Apple are still supporting! This matters because stuff like security and firmware updates really do matter, a lot – and release cycles are getting ever shorter, especially as Macs get targeted more. In short: I couldn’t afford to stay behind any longer!

Update [5 Oct 2014]: Given the Shellshock bash exploit affects both 10.6 and 10.9, but Apple aren’t – as yet – releasing a patch for 10.6, while they rushed a 1.0 patch for 10.9 in less than a week, the security aspect of this upgrade is even more clearly important…

Update [23 Oct 2014]: Nope, I won’t be upgrading to Yosemite for a while, either!

Installing Ubuntu 12.04 LTS on a Dell Windows 8 machine with UEFI

Quick post this – originally on evolve.sbcs but behind a wall there so reposted here – both in case anyone’s interested and in case something breaks and we have to remember what we did, exactly (the far more likely scenario)

Ubuntu. Everyone loves it – probably the world’s easiest dual-boot install going and you get a lot of features out of the box as well. So, for the Windows user who needs Linux features / reliability / configurations but wants to keep a copy of Windows on their system for Office or just build testing, Ubuntu is a great solution.

Well, at least, it used to be. Then Microsoft went and released Windows 8. Quite apart from being an even worse resources hog than W7, it also introduced the horror that is UEFI and a whole load of hell was unleashed. That’s because UEFI – a seemingly harmless set of ‘improvements’ to bootloader management (the bit of firmware on your motherboard which loccates the OS, introduces it to RAM and disk controllers, and generally makes everything you see happen) – is actually a right pain in the arse. Typically for Microsoft, UEFI isn’t just overloaded with features which dilute it’s core purpose: it actually introduces unpredictable behaviour which makes life for anyone nosing beneath the surface of the operating system (Windows) the machine comes with (not just on those trying to dual boot – although I suspect they were the main targets, as millions of people trying Ubuntu for free and realising what an unstable rip-off Windows 8 is wouldn’t play well with shareholders…)

Which is all a ranty way to say that the lovable Ubuntu install process I’ve used for pretty much a decade on Windows 7, XP, Vista and even 98 machines – all of which happily budge up a bit to make room for a dual-boot Ubuntu installation – has been well and truly borked by Windows 8 and their crappy UEFI lock-in. In particular, Hernani, one of our new students, has been issued a shiny new Dell (good for him) with Windows 8 loaded via UEFI (not so good). The install process was markedly more complicated than other Windows / BIOS versions, and while this is a common problem, I couldn’t find a single tutorial online able to help our specific problem – so here you go (PS: I / the lab / QMUL disclaim all liability, express or implied, should you attempt the steps below and suffer…):

  • Create an Ubuntu Live / bootable install USB
  • Shrink your Windows partition
  • Back up your files and make a Windows Restore CD / USB
  • Access the UEFI bootloader
  • Install Ubuntu
  • Re-activate UEFI
  • Check GRUB
  • Install boot-repair
  • Check GRUB
  • Keep backing up your files!

Create an Ubuntu Live / bootable install USB (or CD)

First we need to create a CD or a USB which we’ll use to install Ubuntu (and/or run a ‘live’ session, where you run the OS directly off the disk without installing to the hard drive. This lets you test Ubuntu and verify it works on your machine). This is a very well covered topic, not least from Ubuntu.com themselves, but I’ll just note that we installed Ubuntu 12.04 LTS (long-term stable), 64-bit release onto a USB – the machine in question having no optical drive. We used Unetbootin for this.

Shrink your Windows partition to create space on the hard drive for Ubuntu

To install Ubuntu we’ll need at least 100Gb free disk space. How much of the available space you want to give to Ubuntu is up to you, and probably depends on whether you plan to use your machine mainly in Windows 8 or Ubuntu; Ubuntu can see the Windows partitions, but not the other way round. To do this we need to shrink the existing Windows partition and repartition the existing space. This subject is well covered in this guide (which we started with) but the subsequent steps gave us more trouble (hence this post)…

Back up your files and make a Windows Restore CD / USB

There is a fairly high chance that this will result in a fried filesystem, and you may not even be able to get Windows to boot. You might also lose data if you screw up the disk partitions. So back up all your files. This is very important – not doing so practically guarantees the jinns of Murphy will screw up your hard drive for you. I don’t care how you do this, just do it.

Equally, since we might bork Windows well-and-truly (possibly a good thing but it would be nice to do so on purpose rather than accidentally) it makes sense to burn a Windows Recovery CD (or USB) at the very least. Again, you can do this easily and there’s lots of tutorials out there, so I won’t deal with it here, except to say you’ll need this to be on a separate USB/CD to the Ubuntu one we just made.

Access the UEFI layer and switch to legacy BIOS

The operating system itself (Windows, OSX, Linus etc) doesn’t actually run straight away when you start the computer. This is, in fact, why you can install / upgrade / change / dual-boot various different OSes. Instead, a small piece of software hardcoded onto your motherboard which runs when you hit the power button to start the machine. This is the black screen you glimpse as the machine reboots. Most commonly this software (‘firmware’ in fact, since it is rarely changed or updated) is called BIOS, but a new standard called EFI has become more common in recent years (Windows’ own flavour is the hellish UEFI). It’s main job is to find all the various device drivers (monitor, keyboard, hard disks, USB ports), introduce them to the CPU and RAM, and then hand the whole lot over to an operating system on a disk somewhere to make the good stuff happen. Note that I said ‘an‘ operating system, located somewhere on ‘a’ disc – there’s no reason why Windows should be picked, and why the disc should be the hard drive, and not… say… an Ubuntu install USB! 🙂

So, to access this magical BIOS-y, UEFI-y world where we can tinker about with the default OS we need to shut the computer down, then restart it, and in most personal computers until now hitting a setup key on restart (usually something like F2 or F12) repeatedly would stop the BIOS/UEFI from loading and access a special setup screen. Unfortunately, Windows 8 has a ‘feature’ (read: ‘restriction’) built in called Secure Boot which normally prevents you from doing this insanely simple procedure. Instead there’s a vastly more complicated process which is luckily well explained in this SO thread. Scroll down to ‘Windows 8 + Ubuntu‘ to do this.

Once you have rebooted into BIOS/UEFI, the version that shipped with this Dell (an Opteron 9020 ultra-compact) has a screen with an ‘advanced boot/UEFI options’ submenu and a ‘secure boot’ submenu. Enter the secure boot menu, disable secure boot and click ‘apply’ to save changes then enter the UEFI screen. The most important option on this screen selects between bootloader modes: ‘UEFI’ and ‘legacy’ (which means BIOS). We’ll need the legacy / BIOS mode to install Ubuntu from USB, so select this. The previous set of UEFI options will be replaced by a set of devices. This is the main difference from our point of view – in BIOS mode the computer will try each device for a bootloader file, in the order in which they appear, and the first one found will be run. EFI, on the other hand, is able to look non-sequentially on a device (e.g. a disc) for a specific bootloader. However: we just want to get on with our Ubuntu installation, so make sure the device with your Ubuntu installer (USB or CD) is listed first, click ‘apply’ then ‘exit’ (making sure said media is actually inserted!)

Install Ubuntu

The process to install Ubuntu itself is actually simple enough, and covered in truly exhaustive (some might say ‘tedious’) detail elsewhere – but there’s a couple of installation points we need to note at this stage for our specific application. Again, if you’re doing a complete wipe of Windows you can just to a complete erase and you’ll be fine:

  • Select ‘something else’  in the main install options – do not erase Windows if you want to dual-boot.
  • There’s probably not much point in installing the updates and third-party stuff at this stage – I had to reinstall Ubuntu a few times by trial and error, and the extra packages take a while to download, unpack, and install. Anyway, you can get them later on when the installation’s debugged.
  • The order and location of your partitions matters, a lot. This process is covered in more detail in this SO thread, but to summarise (you should read the whole SO post though):
    • You’ll need 50Gb (50000Mb) for the Ubuntu OS itself, formatted as ext4 and mounted as root ‘/’. On our Dell this is /dev/sda7
    • You’ll need 16Gb (15999Mb) for swap space (Windows users will know this as ‘virtual RAM’. This is technically optional but running without it is very likely to crash frequently from RAM shortages.
    • Don’t touch the existing (Windows) partitions
    • You’ll want to partition the remaining disk space as ext4 (probably) and mount it at ‘/home’
  • Make a note of the password you select for Ubuntu! Otherwise you’ll have to reinstall…
  • Finally – and this is important – although other guides say the bootloader should be written to:
    /dev/sda

    we found this didn’t work with our Dell UEFI. Instead we had to install the bootloader to:

    /dev/sda1

    which worked fine.

Re-activate UEFI

The install process complete, we now need to switch the UEFI back on; this will continue to be the main way you call the bootloader from now on, but hopefully we’ll be using GRUB (a linux bootloader) instead of the Windows bootloader. GRUB allows you to pick which OS to boot every time you restart the computer (it defaults to Ubuntu, but you can choose Windows 8 if you want, it won’t care!) and should be configured automatically. However, we’ll need to turn UEFI back on first: shut the computer down. Wait 5-10 seconds. Now restart it, hitting F2 (or other BIOS / UEFI setup menu hotkey) as soon as you hit the power button to access the UEFI / BIOS screen again.

From the BIOS / UEFI, find the advanced boot options, deselect Legacy mode (BIOS) and reselect UEFI. This will put us in UEFI mode again, so we have to configure the UEFI options to choose the GRUB bootloader. You should see that an ‘ubuntu’ option has appeared in the UEFI list as well as the Windows boot manager. Select this, then click ‘view’ to see which bootloader file it points to.

In our installation, ‘ubuntu’ pointed to (filesystem FS0): /EFI/boot/shimx64.efi. This will simply load ubuntu directly. If you only plan to use Ubuntu, you can stop here, as this option will find Ubuntu every time you reboot. However, we wanted to use GRUB to pick Ubuntu or Windows, so there’s an additional step here: Click ‘add boot option’ (or ‘path’, or similar) to create a new UEFI boot option. Give it a name (we went for ‘GRUB’, logically enough). Then we need to pick the bootloader file itself – this is the file the UEFI will try and find, which will then handle the OS selection and initialisation. In our case, the file we were looking for was (again on filesystem ‘FS0’): /EFI/boot/grubx64.efi. This is the file to load GRUB, but while you’re poking about in here you may also see a directory for Windows and the bootloaders (there’s a few) associated with it. Select this file, save the UEFI option and make sure this is the first boot option in the list (use arrows to move it up) and/or uncheck (don’t delete…) the other options (‘Windows boot manager’ and ‘ubuntu’ probably).

Click ‘apply’ then ‘exit’ and the machine will reboot…

 

Check GRUB

If we were installing next to Windows 7, Vista, XP, NT, 2000, 98 – or almost any other OS – we’d be able to have a cup of tea and a celebratory hobnob at this point, as after rebooting the machine the friendly GRUB boot selection screen would pop up at this point and let us select Ubuntu, Ubuntu recovery mode, memtest or indeed Windows with a simple dialog. We found that GRUB loaded OK, but although all the linux-y options came up, it couldn’t show the Windows boot option. Damn you, Windows 8 – in W7 this is no problem at all. We also wouldn’t have had to prat about with UEFI settings, either but hey, Windows 8 is shiny and looks a bit like iOS, so it must be better, right…

A fair few people have actually managed to get GRUB to function correctly at this stage on W8, but then they’re not reading this blog, are they? 😉

Install boot-repair

If the GRUB bootloader works (that is, you see the GRUB selection screen) but the Windows installation you irrationally cherish so much isn’t shown, you’ll need to edit the GRUB config files to include the location of the Windows bootloader files so that GRUB can display them as an option for you. You can sleuth about the hard drive to find both the GRUB config and Windows .efi files but luckily there’s a handy script from YannUbuntu called boot repair which does this for you. I followed instructions from the SO thread we used before (here, section ‘REPAIRING THE BOOT’). In their words, “In Ubuntu after it boots, open a Terminal and type the following:


sudo add-apt-repository ppa:yannubuntu/boot-repair  
sudo apt-get update
sudo apt-get install boot-repair

“Now run

boot-repair"

That will bring up the boot-repair script (even has a handy GUI). This is pretty easy to use; the ‘recommended repair’ will probably fix things for you assuming you’ve set everything else up OK.

Check GRUB again

Reboot. Pray. Sorry, but ours was working after this step, so you’re on your own if you haven’t had much luck by now. The first thing I would do if this failed would be to check the BIOS was set to UEFI, and that the GRUB option you edited with boot-repair was active (and first) in the list.

Keep backing up your files!

The forums seem to suggest there’s a chance your system will randomly fall over in the next few weeks. So while you should back up your files in general anyway – especially if you’re working with RCUK data – it pays to be extra-vigilant for the next month.

 

That’s all.. good luck!

 

Reviewing FastCodeML, a potential replacement for PAML’s codeml

(first posted at SBCS-Evolve)

Valle and colleagues have just released FastCodeML, an application for the detection of positive selection in DNA datasets. This is a successor SlimCodeML, which is itself an improved implementation of the codeml program from Ziheng Yang’s workhorse phylogenetic analysis package PAML, which continues to be extremely popular, moving through several versions and has at long last acquired a GUI . Note ‘analysis’ – these programs are not intended for phylogeny inference (finding phylogenetic trees from data, although they will do a rough search if required). In fact, for many people the main program of choice for phylogeny inference itself is RAxML, which is also produced by the same lab (Stamatakis). So – given this lab’s pedigree, I was interested to see how FastCodeML compares to codeml, the implementation the authors seek to replace.

Installation

To install FastCodeML you’ll find a set of tar’ed distributions at their FTP site, ftp://ftp.vital-it.ch/tools/FastCodeML/. It wasn’t immediately clear which is the most recent build, but I downloaded both a June 2013 version and the 1.01 release just in case. This includes a precompiled binary for Linux (we’re on CentOS for our dev cluster) and as I was feeling lazy keen to test the performance of the un-optimised binary that ships with the package, I used this. However a simple configure / make / make check / make install will do the job as well. I had to fiddle with a couple of permissions, but that’s all stuff unique to our dev cluster and common to everything we install, so it’s hardly fair to penalise FastCodeML for that…

Interface

FastCodeML, like RAxML, uses arguments passed to the command-line at runtime rather than a GUI (like, say HYPHY or Garli) or a configuration script (like PAML or BEAST.) As a result (assuming you’ve added the FastCodeML executable, fast, to your $PATH) you can paste and go, e.g.:

fast -m 22 -bf -hy 1 ./<tree-file> <sequence-file>

I’m ambivalent about passing arguments on the command line for phylogenetic analyses. Let’s look at the pros and cons: A GUI is more attractive to many users, but with the important caveat that it is well designed, otherwise the useability traded off against performance / flexibility will be degraded. A lot of phylogeneticists are used to working without GUIs and (more importantly) crunching large genomic datasets these days; so the lack of a GUI isn’t really a problem.

What about config files, the approach taken by the original PAML package? Well, they are fiddly at times, but there are a number of advantages. Firstly, the config file itself is a useful aide-memoire, and default parameter values can be suggested. Secondly, and perhaps most importantly in a phylogenomic context, the config file acts as a record of the analysis settings. It can be easier to debug pipelines with config files as well. However, a drawback of config files is their size, and in a pipeline context relying on another file existing where and when you need it can be risky. Passing arguments on the other hand avoids this, and also allows you to work more efficiently with pipes.

About the most significant drawback of the pass-by-argument approach is that as the number of parameters/options increases, keeping track of your own intentions becomes much harder. For example, running a BEAST analysis using arguments would be virtually impossible. In our pipeline API I coded a wrapper for RAxML where the various options could be laid out clearly in the comments, and various operation modes defined as enumerated types; but despite years of using RAxML I still forget certain command-line options from time-to-time if running single analyses. Overall, though, for simple applications like RAxML, MUSCLE or FastCodeML, I think I prefer command-line arguments as simpler and cleaner. Hopefully feature creep won’t bloat the option set of future releases.

Output

The output from FastCodeML is pleasingly clean and simple:

Doing branch 0

LnL1: -2393.120011398291808 Function calls: 1259

  0.1982869  0.1727072  0.2111106  0.0741396  0.0443093  0.0160066  0.0102474  0.0050577
  0.0084463  0.0254361  0.1328334  0.0754612

p0:  0.4817193  p1:  0.5182807  p2a:  0.0000000  p2b:  0.0000000
w0:  0.0000010  k:   5.1561766  w2:   1.1258227

This contrasts with several screens’ worth of ramblings from PAML, on both stdout and output streams (even with verbose disabled). To get the same data from PAML you have to dig about a bit in the stdout or results file, or else look at rst and find:

dN/dS (w) for site classes (K=3)

site class             0         1         2
proportion        0.55182   0.32889   0.11929
branch type 0:    0.05271   1.00000   3.43464
branch type 1:    0.05271   1.00000   0.00000

The two methods are broadly in agreement, to a first approximation: both the proportions and omegas (dN/dS) estimated are fairly similar. I wasn’t too worried about the numerical value of the estimates here, as there are subtle differences in the implementation of both.

But what about their speed? This is the major selling-point of FastCodeML after all; there are many competing methods to detect natural selection but the pedigree of Stamatakis and colleagues promises a substantial improvement over PAML. There are two considerations here: wall-clock time, the time the calculation appears to take to the user (most relevant to single local analyses); and  CPU time, which excludes I/O and time the thread spends sleeping or suspended – for cluster work this is far more important since a majority of jobs will be scheduled and CPU time rationed.

Well, according to POSIX::time(), PAML does fairly well:

real	0m50.872s
user	0m50.781s
sys	0m00.006s

But what about FastCodeML?

real	0m05.641s
user	1m22.698s
sys	0m01.158s

Pow. The important thing to note here is that the speed-up is better across both CPU/real time and wall clock/user time; but that the improvement factor is much greater for CPU (around x2 better for wall clock time, vs. x10 for CPU time.) This is extremely promising and suggests that cluster/pipeline operations should be significantly faster using FastCodeML…

 

Conclusions

If I’d been bothered enough to use some sort of boxing analogy throughout this post, I’d be using metaphors like ‘sucker punch’ and ‘knockout blow’ by now. FastCodeML really is fast, and impressively so. It also offers a reduced filesystem footprint and cleaner I/O, which will likely make it ideal for cluster/pipeline work. Assuming the accuracy and specificity of its implementation are at least comparable to PAML (but that’s a much harder thing to test) I’m likely to use FastCodeML for most analyses of natural selection in future.

Using x-windows to render GUIs for server-client phylogenomics

The power of modern computers allied to high-throughput next-generation sequencing will usher in a new era of quantitative biology that will deliver the moon on a stick to every researcher on Earth…

Well, quite.

But something we’ve run up against recently in the lab is actually performing useful analyses on our phylogenomic datasets after the main evolutionary analyses. That is, once you’ve assembled, aligned, built phylogenies and measured interesting things like divergence rates and dN/dS – how do you display and explore the collected set of loci?

I’ll give you an example – suppose we have a question like “what is the correlation between tree length and mean dS rate in our 1,000 loci?” It’s fairly simple to spit the relevant statistics out into text files and then cat them together for analysis in R (or Matlab, but please, for the love of God, not Excel…). And so far, this approach has got a lot of work done for us.


But suppose we wanted to change the question slightly – maybe we notice some pattern when eyeballing a few of the most interesting alignments – and instead ask something like “what is the correlation between tree length and mean dS rate in our 1,000 loci, where polar residues are involved with a sitewise log-likelihood of ≤ 2 units?” Or <3? or <10? Most of this information is already available at runtime, parsed or calculated. We could tinker with our collation script again, re-run it, and do another round of R analyses; but this is wasteful on a variety of levels. It would be much more useful to interact with the data on the server while it’s in memory. We’ve been starting to look at ways of doing this, and that’s for another post. But for now there’s another question to solve before that becomes useful, namely – how do you execute a GUI on a server but run it on a client?

There are a variety of bioinformatics applications (Geneious; Galaxy; CLCBio) with client-server functionality built-in, but in each case we’d have to wrap our analysis in a plugin layer or module in order to use it. Since we’ve already got the analysis in this case, that seems unnecessary, so instead we’re trialling X-windows. This is nothing to do with Microsoft Windows. It’s just a display server which piggybacks an SSH connection to display UNIX Quartz (display) device remotely. It sounds simple, and it is!

To set up an x-windowing session, we just have to first set up the connection to the bioinformatics server:

ssh -Y user@host.ac.uk

Now we have a secure x-window connection, our server-based GUI will display on the client if we simply call it (MDK.jar):

User:~ /usr/bin/java -jar ~/Documents/dev-builds/mdk/MDK.jar

Rather pleasingly simple, yes? Now there’s a few issues with this, not least a noticeable cursor lag. But – remembering that the aim here is simply to interact with analysis data held in memory at runtime, rather than any do complicated editing – we can probably live with that, for now at least. On to the next task!

(also published at evolve.sbcs)

Our Nature paper! Genome-wide molecular convergence in echolocating mammals

Exciting news from the lab this week… we’ve published in one of the leading journals, Nature!!!

Much of my work in the Rossiter BatLab for the last couple of years has centred around the search for genomic signatures of molecular convergence. This means looking for similar genetic changes in otherwise unrelated organisms. We’d normally expect unrelated organisms to differ considerably in their genetic sequences, because over time random mutations occur in their genomes; the more time has passed since two species diverged, the more changes we expect. However, we know that similar structures may evolve in unrelated species due to shared selection pressures (think of the streamlined body shapes of sharks, icthyosaurs and dolphins, for example). Can these pressures produce identical changes right down at the level of genetic sequences? We hoped to detect identical genetic changes in unrelated species (in this case, the echolocation – ‘sonar hearing’ – in some species of bats and whales) caused by similar selection pressures operating on the evolution of the genes required for those traits.

It’s been a long slog – we’ve had to write a complicated computer program to look at millions of letters of DNA – but this week it all bears fruit. We found that a <em>staggering</em> number of genes in the genomes of echolocating bats and whales (a bottlenose dolphin, if you must) showed evidence of these similar genetic changes, known technically as ‘genetic convergence’.

Obviously we started jumping up and down when we found this, and because we imagined other scientists would too, we wrote up our findings and sent them to the journal <em>Nature</em>, one of the top journals in the world of science… and crossed our fingers.

Well, today we can finally reveal that we were able to get through the peer-review process (where anonymous experts scrutinise your working – a bit like an MOT for your experiments), and the paper is out today!

But what do we actually say? Well:
<blockquote>Evolution is typically thought to proceed through divergence of genes, proteins and ultimately phenotypes. However, similar traits might also evolve convergently in unrelated taxa owing to similar selection pressures. Adaptive phenotypic convergence is widespread in nature, and recent results from several genes have suggested that this phenomenon is powerful enough to also drive recurrent evolution at the sequence level. Where homoplasious substitutions do occur these have long been considered the result of neutral processes. However, recent studies have demonstrated that adaptive convergent sequence evolution can be detected in vertebrates using statistical methods that model parallel evolution, although the extent to which sequence convergence between genera occurs across genomes is unknown. Here we analyse genomic sequence data in mammals that have independently evolved echolocation and show that convergence is not a rare process restricted to several loci but is instead widespread, continuously distributed and commonly driven by natural selection acting on a small number of sites per locus. Systematic analyses of convergent sequence evolution in 805,053 amino acids within 2,326 orthologous coding gene sequences compared across 22 mammals (including four newly sequenced bat genomes) revealed signatures consistent with convergence in nearly 200 loci. Strong and significant support for convergence among bats and the bottlenose dolphin was seen in numerous genes linked to hearing or deafness, consistent with an involvement in echolocation. Unexpectedly, we also found convergence in many genes linked to vision: the convergent signal of many sensory genes was robustly correlated with the strength of natural selection. This first attempt to detect genome-wide convergent sequence evolution across divergent taxa reveals the phenomenon to be much more pervasive than previously recognized.</blockquote>
Congrats to Steve, Georgia and Joe! After a few deserved beers we’ll have our work cut out to pick through all these genes and work out exactly what all of them do (guessing the genes’ biological functions, especially in non-model (read:not us or things we eat) organisms like bats and dolphins is notoriously tricky. So we’ll probably stick our heads out of the lab again in <em>another</em> two years…

The full citation is: Parker, J., Tsagkogeorga, G., Cotton, J.A.C., Liu, R., Stupka, E., Provero, P. &amp; Rossiter, S.J. (2013) Genome-wide signatures of convergent evolution in echolocating mammals. <em>Nature</em> (epub ahead of print), 4th September 2013. doi:10.1038/nature12511. This work was funded by Biotechnology and Biological Sciences Research Council (UK) grant BB/H017178/1.

&nbsp;

Take care with those units…

Recent furore over Rogoff & Reinhart’s discredited research on the effects of high national debt:GDP ratios led me to consider some differences between economics and the natural sciences… If you’re familiar with the arguments, my perspective – as a scientist – is nearer the bottom.

Metres are not feet. Seconds are not years. Take care with your calculations, and always compare them against what you expect. Data are noisy, so repeat significant results and check, check, check your working. It’s the first lesson every scientist learns, reinforced through years of training. Because natural science (literally ‘knowledge’, eh kids?) is essentially an attempt to discover and convey truths about how the universe works, scientists have an almost pathological fear of mistakes. Results are checked and re-checked. Journal articles are scrutinised by expert reviewers and editors, and errata swiftly published. And scientists tread with particular care where their results impinge on controversial phenomena with an impact on human society, such as genetics or climate science.

Loch Lomond: Hard to mistake for a strictly deterministic phenomenon. Picture: Wikimedia

There are exceptions.  Few professional scientists (including me) would confidently claim their published work is entirely error-free. Nonetheless, just as aero engineers worry about ‘unmodelled’ faults (those they have not anticipated) more than modelled ones (the ones they expect and can mitigate for), the general attitude amongst scientists is a fairly healthy level of conscious vigilance combined with a realistic acceptance that mistakes happen. Maybe we are too cautious when it comes to engaging with the policy implications of our research; maybe not.

Does economics – that wide-ranging and fascinating discipline, practiced by a range of actors from academics, to politicians and pundits, to bankers and businesses – take a similar approach?

I’m not so sure.

A recent episode illustrates a couple of worrying issues. First, the background: In 2009 Carmen Reinhart and Kenneth Rogoff, two Harvard economists, published an analysis of the relationship between national (government) debt and GDP (national wealth) in a dataset of 20 historical examples. Their finding: a ratio of debt:wealth greater than 90% (roughly speaking, owing more than 90p for every £1 earnt) was correlated with a decline in GDP growth from a few percent per year (around ~1-3% GDP growth per year being typical for healthy, robust postindustrial democracies) to -0.1%. A very rough reading of this is: Western countries that owe more than 90% of GDP will experience economic contraction. At the time the US (well, everyone) was gripped by financial panic and, as respected academics, their work attracted substantial attention (this is all much better documented by Heidi Moore among others).

These raw data are numbers – but still as messy as trees, stars, or chemical reactions. Picture: Wikimedia.

Now, every academic is conscious today of the need, where possible, to publish high-profile research, and a few headlines never hurt. These often result in a bit of ‘sexing-up’ when your study is reported by the university press office, and further dumbing-down by the time the story makes it to national news media (how sexing-up becomes dumbing-down is a great mystery). Rogoff & Reinhart went much further than this, and in addition to seeking media appearances, called publicly for the conclusion of their analysis (‘debt is correlated with economic stagnation’) to be turned into policy (‘governments must reduce public debt, or they will cause further recession’.) Their advice gave supporters of fiscal austerity some powerful rhetorical ammunition. Which they unloaded promptly.

This very public stance from a two-person research team drawing a startling conclusion would be unusual even in the charged arenas of public health, pharmaceutical or climate science, where the extreme policy implications mean applied research attracts significant attention and conclusions are regularly distorted. However, the professional researchers involved, by and large, are aware of this – and, as mentioned above, take pains to ensure the robustness of their results accordingly.

Rogoff & Reinhart were unlucky. A reanalysis of their data by peers revealed significant problems with their work, including inappropriate methodology and even (incredibly) simple arithmetical errors (apparently they actually used Excel to implement their models, which is staggering when professional tools like R or Matlab are available). Understandably, given the impact their work had the first time round, a storm of recriminations ensued.

I’m a computational biologist, not an economist, so I can’t add to the existing critique of the original study; either its data, methods, analyses or conclusions. What I can offer is  my perspective as a professional natural scientist. First, Rogoff & Reinhart’s desire to disseminate their apparently noteworthy result was entirely understandable. All academics share this excitement at uncovering a new relationship, getting a new model or technique to work, or spotting a unifying theme in previously unrelated pieces of information. Without this science would be totally, utterly, drearily mundane. The novelty of true discovery is what separates school science (boring) from real science (fascinating), trust me.

The worrying thing for me isn’t their desire to publicise their research, or even their apparent failure to check their results (though this would surely have been detected in a thorough review), or evenand I can’t really believe this either – they apparently fitted these complicated models in Excel (which you reallyreally shouldn’t ever do, especially when so many alternatives are available – if you don’t believe me, consider Excel’s notorious numerical sloppiness).

Instead, I’m concerned about two effects that amplified the repercussions of a poor piece of research. Firstly, unlike in science, where there is a clear division between theoretical, applied, and industrial research – with established conduits to policymakers – individual economists often seem to operate over a much wider set of  contexts. In some cases, there seems to be a carousel of posts in academia, business and government. This may well be one motivation of economics (if I could get test my theories on community phylogenetics by getting cosy with God and persuading Him to nudge the genetic mutation rate in some entire ecosystems up a bit, I probably would – just out of curiosity). But at best, it seems to blur the distinction between rigorous research, popular economics, and policy-making. At worst it creates a severe moral hazard; a researcher with a strong point of view, picked (and remunerated) as a government advisor, will have less incentive to moderate their view or publish contradictory results. Politicians tend to treat economists  (and their ideas) like pets – when in fact the consequences of economic policy are so wide-ranging that a statutory panel of experts might be a better idea. The IPCC might be a useful model (I’m not even going to consider the UK’s politicised and opaque OBR).

Secondly, policy-makers and the public in general do not have a clear idea of the philosophical limitations of economics. The oft-used phrasing which holds that it is ‘an inexact science’ appears to build in some caution; but then the taking of penalty kicks in football is described in similar terms, so that caveat is probably inadequate. There’s also the natural scientists’ commonest objection to the inclusion of economics with science, e.g. few other phenomena respond to (and interact with) the state of human knowledge about them. That is, gravity works identically whether you are aware of relativity, or not. Markets, aware that ‘market confidence’ can affect business, may react badly to new research showing markets overestimate market confidence by losing their confidence, an effect which may now need to be modelled in studies of market confidence… So although parts of economics are quantitative and scientific, the discipline very possibly isn’t since the phenomenon of human economies includes humans’ study of economics. Gödel would have a field day.

Finally, and this our fault, not economists’, I believe that one under-appreciated source of bias in the public’s understanding of economics (though possibly not economists themselves) has to to with numerosity, the collection of cognitive biases we exhibit when dealing with numbers. We’re only just starting to understand how these affect our thought processes, but already it seems clear that we think about numbers very differently to how we think we think about them. I reckon there’s a particular issue with economics – call it a ‘countability bias’: since much of the phenomena under investigation (currency, capital, rates of exchange and interest) are numbers; quantitative models use numbers; therefore (our brains might subconsciously assume) largely descriptive and qualitative models predicting numbers are actually quantitative, even deterministic. Clearly, practicing economists are aware of this distinction. But are the public?

 

The mode and tempo of hepatitis C virus evolution within and among hosts.

BMC Evol Biol. 2011 May 19;11(1):131. [Epub ahead of print]

Gray RR*, Parker J*, Lemey P, Salemi M, Katzourakis A, Pybus OG.

*These authors contributed equally to this article.

BACKGROUND:

Hepatitis C virus (HCV) is a rapidly-evolving RNA virus that establishes chronic infections in humans. Despite the virus’ public health importance and a wealth of sequence data, basic aspects of HCV molecular evolution remain poorly understood. Here we investigate three sets of whole HCV genomes in order to directly compare the evolution of whole HCV genomes at different biological levels: within- and among-hosts. We use a powerful Bayesian inference framework that incorporates both among-lineage rate heterogeneity and phylogenetic uncertainty into estimates of evolutionary parameters.

RESULTS:

Most of the HCV genome evolves at ~0.001 substitutions/site/year, a rate typical of RNA viruses. The antigenically-important E1/E2 genome region evolves particularly quickly, with correspondingly high rates of positive selection, as inferred using two related measures. Crucially, in this region an exceptionally higher rate was observed for within-host evolution compared to among-host evolution. Conversely, higher rates of evolution were seen among-hosts for functionally relevant parts of the NS5A gene. There was also evidence for slightly higher evolutionary rate for HCV subtype 1a compared to subtype 1b.

CONCLUSIONS:

Using new statistical methods and comparable whole genome datasets we have quantified, for the first time, the variation in HCV evolutionary dynamics at different scales of organisation. This confirms that differences in molecular evolution between biological scales are not restricted to HIV and may represent a common feature of chronic RNA viral infection. We conclude that the elevated rate observed in the E1/E2 region during within-host evolution more likely results from the reversion of host-specific adaptations (resulting in slower long-term among-host evolution) than from the preferential transmission of slowly-evolving lineages.