Category Archives: Coding

I’m a programmer – and driverless cars scare the hell out of me

Tech-savvy developer types often ride bikes, and often instinctively back the idea of robot vehicles. After all, the subconscious asks, if I can code a computer to play a game, what’s so hard about getting it to move around a real map?

But driverless cars are not like ‘normal’ AI. They exist in a world with all-too real consequences. For the first time we’re asking billions of people to cede control of a lethal machine to a still-highly-experimental, incredibly complicated, autonomous system. Noise, edge cases and unusual behaviours not present in training data all abound. Autonomous vehicles will be more dangerous for cyclists and pedestrians, not less.

Bikes and people can suddenly reverse, or move sideways, or in diagonally. They can even jump! This means to truly safely deal with these types of users – call it the ‘Friday Night In Peckham Test’ – is a much bigger challenge than a motorway, where three lanes of users move in predictable patterns, and all between 40-80MPH.

There’s a reason car manufacturers are testing away from towns, and won’t share testing data, or test specifications. They’d rather not have to deal safely with pedestrians and bikes because they know how much more costly they are. So car-makers would rather they went away. They’ll probably get their wish because they’re marque ‘heavy industry’ employers, and because AI belongs to the sleek shiny optimistic future of tech that all politicians are desperate to court.

Instead we will see two worrying developments to shift responsibility from car makers. There will be pressure to gradually remove troublesome bikes and pedestrians from parts of the roads ‘for their own safety’. We’ve already started to see this.

Alternatively manufacturers will introduce two levels of AI – a truly autonomous ‘safe mode’ which overreacts to all stimuli with infuriatingly (unworkably) slow speeds, or a ‘sport setting’ which makes much riskier decisions to enable faster speeds, under the flimsy caveat that users ‘monitor the system more actively’. Most will prefer to travel faster but few will be bothered to supervise the AI closely or consistently, when Netflix and social media are available distractions.

Finally, a growing body of evidence shows that too much automation of routine tasks can make them more dangerous, not less. Airline pilots’ skills atrophy when autopilot handles most of their flying time – with literally lethal consequences when an emergency occurs. Car drivers – most of whom already take driving far less seriously than such a dangerous activity merits – will suffer the same fate. Who would bet on a driver, suddenly handed control back in an emergency situation, taking the right decision in a split-second if they have perhaps hardly driven for years?

We’d just started to halt the decades long declines in cycling and walking rates, and lower urban speeds to 20MPH (‘twenty’s plenty’ initiatives), but the rise of autonomous vehicles will likely threaten this and see walking and cycling people relegated to third place, behind fleets of AI cars travelling bumper to bumper at far higher speeds than today and a few die-hard petrolheads defiantly navigating their own vintage tanks among the whizzing fleet.

One of the best things about my teenage years was romping around town on foot or bikes with my mates. My daughter turns 16 in 2030. It’s terrifying to think that by then, the killer robots that end her life might not be gun-toting drones, put plain old delivery vans.

Some aspects of BLASTing long-read data

Quick note to explain some of the differences we’ve observed working with long-read data (MinION, PacBio) for sample ID via BLAST. I’ll publish a proper paper on this, but for now:

  • Long reads aren’t just a bit longer than Illumina data, but two, three, four or possibly even five orders of magnitude longer (up to 10^6 already, vs 10^2). This is mathematically obvious, but extremely important…
  • … the massive length means lots of the yield is in comparatively few reads. This makes yield stats based on numbers of reads positively useless for comparison with NGS. Also…
  • Any given long read contains significantly more information than a short one does. Most obviously the genomics facilities of the world have focused on their potential for improving genome assembly contiguity and repeat spanning (as well as using synteny to spot rearrangements etc) but we’ve also shown (Parker et al, submitted) that whole coding loci can be directly recovered from single reads and used in phylogenomics without assembly and annotation. This makes sense (a ~kb long read can easily span a whole gene, also ~kb in scale) but it certainly wasn’t initially obvious, and given error rates, etc, it’s surprising it actually works.
  • Sample ID using BLAST actually works very differently. In particular, the normal ‘rubbish in, rubbish out’ rule is inverted. In other words, nanopore reads (for the time being) may be long, but inevitably contain errors. However, this length means that assuming BLAST database sequences are approximately as long/contiguous, Nanopore queries tend to either match database targets correctly, with very long alignments (hundreds/thousands of identities), or not at all.

This last point is the most important. What it means is that, for a read, interpreting the match is simple – you’ll either have a very long alignment to a target, or you won’t. Even when a read has regions of identity to more than one species, the correct read has a much longer cumulative alignment length overall for the correct one. This is the main result of our paper.

The second implication is that, as it has been put to me, for nanopore reads to be any good for an ID, you have to have a genomic database. While this is true in the narrow sense, our current work (and again, this is partly in our paper, and partly in preparation) shows that in fact all that matters is for the length distribution in the reference database to be similar to the query nanopore one. In particular, we’ve demonstrated that a rapid nanopore sequencing run, with no assembly, can itself serve as a perfectly good reference for future sample ID. This has important implications for sample ID but as I said, more on that later 😉

Step-by-step: Raspberry Pi as a wifi bridge, plus a (really) low-spec media centre…

I’ll keep this brief, really so, because this is mainly an aide-memoire for when this horrific bodge breaks in the next, ooh, month or so. But, for context:

The problem:

Our office/studios are in a shed at the bottom of the garden (~15m). Wifi / wireless LAN just reaches, intermittently.

The solution:

Set up an ethernet network in the shed itself, and connect (‘bridge’) that network to the house wifi with a Raspberry Pi.


1x Raspberry Pi (Pi 2 Model B; mine overclocked to ~1150MHz) plus SD card and reader; an old ethernet switch and cables; quite a lot of patience.

A bit more detail:

This step-by-step is going to be a bit arse-about-face, in that the order of the steps you’d actually need from scratch is completely different from the max-frustration, highly circuitous route I actually followed. Not least because I already had a Pi with Ubuntu on:

  1. Get a Pi with Ubuntu on it. This will be acting as the wireless bridge to connect the LAN to the wifi; and also serve IP addresses to other hosts on the LAN (network buffs: yes, I realise this is a crap solution). This is the second-easiest step by a mile; see: this guide for MATE and follow it. We’ll set the Pi up to run without a monitor or keyboard (‘headless’ – connecting over SSH) later, but for now don’t ruin your relationship unduly, do this bit the easy way with a monitor attached.
  3. apt-get update the Pi a few times. You’ll thank yourself later.
  4. Set the Pi up to act as a wifi <–> LAN bridge. There are a lot of tutorials suggesting various ways to achieve this such as this, this, and all of this noise. But ignore them all – with the newest Ubuntu LTS (16.04 at time of writing) this is now far, far, far easier to do in the GUI, and more stable. Just follow this guide.
  5. Set up some other housekeeping tasks for headless login: enable SSH (see also here); set the clock to attempt to update the system time on boot if a time server’s available (make sure to add e.g. server to your /etc/ntp.conf file) and login to the desktop automatically. This last action isn’t necessary, and purists will claim it wastes resources, but this is a Pi 2 and we’re only serving DCHP on it, basically – it can afford that. The reason I’ve enabled this is because it seems to force the WLAN adapter to try to acquire the home wifi a bit more persistently (see below). I’ve tried to achieve the same results using wpa_supplicant, but with no stability and my time is a pretty finite resource, so screw it – I’m a scientist, not an engineer!
  6. Lastly, I’ve made some fairly heavy-duty edits (not following but at least guided by this and this) to my /etc/network/interfaces file, with a LOT of trial and error which included a couple of false starts bricking my Pi (if that happens to you, reinstall Ubuntu. Sorry.) It now reads (home wifi credentials redacted):
    # interfaces(5) file used by ifup(8) and ifdown(8)
    # Include files from /etc/network/interfaces.d:
    source-directory /etc/network/interfaces.d# The loopback network interface
    auto lo
    iface lo inet loopback

    # LOOK at all the crap I tried...
    #allow-hotplug wlan0
    #iface eth0 inet dhcp
    #allow-hotplug wlan0
    #iface wlan0 inet manual
    #iface eth0 inet auto
    #wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf
    # Yep, that lot took a while :\

    # Finally, this worked:
    auto wlan0
    iface wlan0 inet dhcp
    wpa-ssid "q***********"
    wpa-psk "a******"
    # That's it :D
  7. Connect the Pi to your other computers using the switch and miles of dodgy ethernet cabling.
  8. Disconnect the screen, reboot, and wait for a long time – potentially hours – for the Pi to acquire the wifi. You should now be able to a) ping and/or login to the Pi from other hosts on the LAN, and b) ping/access hosts on the home WLAN, and indeed the wider Internet if your WLAN has a connection(!)

A Media centre from scratch

Lastly of all, having gone to all that trouble, the glaring bandwidth inadequacies of our crap WLAN showed up. Being stingy by nature (well, and because the phone companies in our area insist that, despite living fewer than a day’s march from Westminster, their exchanges have run out of fibre capacities for 21st-century broadband) I decided to mitigate this for the long winter months the simplest way: gather the zillions of mp3s, ripped DVDs and videos from all our devices onto one server. I put an Ubuntu (the same 16.04 / MATE distribution as on the Pi, in fact) onto an old Z77 motherboard my little brother had no use for, in an ancient (~2003) ATX case, with a rock-bottom Celeron new CPU (~£25) plus 4MB SDRAM and cheap spinning drive I had lying about (a 2TB Toshiba SATA, IIRC). This is highly bodgy. So much so, in fact, that the CPU fan is cable-tied onto the mobo, because the holes for the anchor pins didn’t line up. But: it works, and only has to decode/serve MP3s and videos, after all.

I apt-get updated that a few times, plus adding in some extra packages like htop, openssh, and hardinfo – plus removing crap like games and office stuff – to make it run about as leanly as possible. Then, to manage and serve media I installed something I’d wanted to play with for a while: Kodi. This is both a media manager/player (like iTunes, VLC, or Windows Media Player) and also streaming media server, so other hosts on my LAN can access the library by streaming over the ethernet if they want without replicating files.

Setting up Kodi was simplicity itself, as was adding movies and music to the library, but one minor glitch I encountered was reading/displaying track info and artwork, which usually happens on iTunes pretty seamlessly via ID3 tags, fingerprinting, and/or Gracenote CDDB querying. Turns out I’d been spoilt this last decade, because in Kodi this doesn’t happen so magically. Instead, you need to use a tool like MusicBrainz Picard to add the tags to MP3s on your system, then re-scan them into Kodi for the metadata to be visible. The re-scanning bit isn’t as onerous as you’d think – files are left in place, the ID3 tags being used simply to update Kodi’s metadata server (I guess) – but the initial Picard search for thousands of MP3s over a slow WLAN took me most of a night.

However. A small price to pay to actually have music to listen to while I work away writing crap like this in the shed, or shoddy-quality old episodes of Blackadder or Futurama to watch in the evening :p

Copying LOADS of files from a folder of LOADS *AND* LOADS more in OSX

Quick one this, as it’s a tricky problem I keep having to Google/SO. So I’m posting here for others but mainly myself too!

Here’s the situation: you have a folder (with, ooh, let’s say 140,000 separate MinION reads, for instance…) which contains a subset of files you want to move or copy somewhere else. Normally, you’d do something simple like a wildcarded ‘cp’ command, e.g.:

host:joeparker$ cp dir/*files_I_want* some/other_dir

Unfortunately, if the list of files matched by that wildcard is sufficiently long (more than a few thousand), you’ll get an error like this:

-bash: /bin/cp: Argument list too long

In other words, you’re going to have to be more clever. Using the GUI/Finder usually isn’t an option either at this point, as the directory size will likely defeat Finder, too. The solution is pretty simple but takes a bit of tweaking to work in OSX (full credit to posts here and here that got me started).

Basically, we’re going to use the ‘find’ command to locate the files we want, then pass each one in turn as an argument to ‘cp’ using ‘find -exec’. This is a bit slower overall than doing the equivalent as our original wildcarded command but since that won’t work we’ll have to lump it! The command* is:

find dir -name *files_I_want* -maxdepth 1 -exec cp {} some/other_dir \;</p>

Simple, eh? Enjoy 🙂

*NB, In this command:

  • dir is the filesystem path to start the search; ‘find’ will recursively traverse the directory tree including and below this folder;
  • -name *glob* gives the files to match;
  • -maxdepth is how deep to recurse (e.g. 1 = ‘don’t recurse’);
  • cp is the command we’re executing on each found file (could be mv etc);
  • {} is the pipe argument standing for the found file;
  • some/other_dir is the destination argument to the command invoked by -exec

How to fake an OSX theme appearance in Linux Ubuntu MATE

I’ve recently been fiddling about and trying to fake an OSX-style GUI appearance in Linux Ubuntu MATE (15.04). This is partly because I prefer the OSX GUI (let’s be honest) and partly because most of my colleagues are also Mac users mainly (bioinformaticians…) and students in particular fear change! The Mac-style appearance seems to calm people down. A bit.

The specific OS I’m going for is 10.9 Mavericks, because it’s my current favourite and nice and clear. There are two main things to set up: the OS itself and the appearance. Let’s take them in turn.

1. The OS

I’ve picked Ubuntu (why on Earth wouldn’t you?!) and specifically the MATE distribution. This has a lot of nice features that make it fairly Mac-y, and in particular the windowing and package management seem smoother to me than the vanilla Ubuntu Unity system. Get it here:* The installation is pretty painless on Mac, PC or an existing Linux system. If in doubt you can use a USB as the boot volume without affecting existing files; with a large enough partition (the core OS is about 1GB) you can save settings – including the customisation we’re about to apply!

*We’re installing the 15.04 version, not the newest release, as 15.04 is an LTS (long-term stable) Ubuntu distribution. This means it’s supported officially for a good few years yet. [Edit: Arkash (see below) kindly pointed out that 14.04 is the most recent LTS, not 15.04. My only real reason for using 15.04 therefore is ‘I quite like it and most of the bugs have gone'(!)]

2. The appearance

The MATE windowing system is very slick, but the green-ness is a bit, well, icky. We’re going to download a few appearance mods (themes, in Ubuntu parlance) which will improve things a bit. You’ll need to download these to your boot/install USB:

Boot the OS

Now that we’ve got everything we need, let’s boot up the OS. Insert the USB stick into your Mac-envious victim of choice, power it up and enter the BIOS menu (F12 in most cases) before the existing OS loads. Select the USB drive as the boot volume and continue.

Once the Ubuntu MATE session loads, you’ll have the option of trialling the OS from the live USB, or permanently installing it to a hard drive. For this computer I won’t be installing to a hard drive (long story) but using the USB, so customising that. Pick either option, but beware that customisations to the live USB OS will be lost should you later choose to install to a hard drive.

When you’re logged in, it’s time to smarten this baby up! First we’ll play with the dock a bit. From the top menu bar, select “System > Preferences > MATE Tweak” to open the windowing management tool. In the ‘Interface’ menu, change Panel Layouts to ‘Eleven’ and Icon Size to ‘Small’. In the ‘Windows’ menu, we’ll change Buttons Layout to ‘Contemporary (Left)’. Close the MATE Tweak window to save. This is already looking more Mac-y, with a dock area at the bottom of the screen, although the colours and icons are off.

Now we’ll apply some theme magic to fix that. Select “System > Preferences > Look and Feel > Appearance”. Now we can customise the appearance. Firstly, we’ll load both the ‘Ultra-Flat Yosemite Light’ and ‘OSX-MATE’ themes, so they’re available to our hybrid theme. Click the ‘Install..’ icon at the bottom of the theme selector, you’ll be able to select and install the Ultra-Flat Yosemite Light theme we downloaded above. It should unpack from the .zip archive and appear in the themes panel. Installing the OXS-MATE theme is slightly trickier:

  • Unzip (as sudo) the OSX-MATE theme to /usr/share/themes
  • Rename it from OSX-MATE-master to OSX-MATE if you downloaded it from git as a whole repository (again, you’ll need to sudo)
  • Restart the appearances panel and it should now appear in the themes panel.

We’ll create a new custom theme with the best bits from both themes, so click ‘Custom’ theme, then ‘Customise..’ to make a new one. Before you go any further, save it under a new name! Now we’ll apply changes to this theme. There are five customisations we can apply: Controls, Colours, Window Border, Icons and Pointer:

  • ControlsUltra-Flat Yosemite Light
  • Colours: There are eight colours to set here. Click each colour box then in the ‘Colour name’ field, enter:
    • Windows (foreground): #F0EAE7 / (background): #0F0F0E
    • Input boxes (fg): #FFFFFF / (bg): #0F0F0E
    • Selected items (fg): #003BFF / (bg): #F9F9F9
    • Tooltips: (fg): #2D2D2D / (bg): #DEDEDE
  • Window borderOSX-MATE
  • IconsFog
  • PointerDMZ (Black)

Save the theme again, and we’re done! Exit Appearance Preferences.

Finally we’ll install Solarized as the default terminal (command-line interface) theme, because I like it. In the MATE Terminal, Unzip the solarized-mate-terminal archive, as sudo. Enter the directory and simply run (as sudo) the install script using bash:

$ sudo unzip solarized-mate-terminal
$ cd solarized-mate-terminal
$ bash

Close and restart the terminal. Hey presto! You should now be able to see the light/dark Solarized themes available, under ‘Edit > Profiles’. You’ll want to set one as the default when opening a new terminal.


Later, I also installed Topmenu, a launchpad applet that gives an OSX-style top-anchored application menu to some linux programs. It’s a bit cranky and fiddly though, so you might want to give it a miss. But if you have time on your hands and really need that Cupertino flash, be my guest. I hope you’ve had a relatively easy install for the rest of this post, and if you’ve got any improvements, please let me know!

Happy Tweaking…

BaTS (and Befi-BaTS), SHiAT, and Genome Convergence Pipeline have moved!

Important – please take note!

  • All my phylogenetics software is now on GitHub, not websites or Google Code
  • Please use the new FAQ pages and issue/bug tracker forms, rather than emailing me directly in the first instance

Until now, I’ve been hosting the open-sourced parts of my phylogenetics software on These include the BaTS (and Befi-BaTS) tools for phylogeny-trait association correlations; the alignment profilers SHiAT (and Genious Entropy plugin), and the Genome Convergence API for the Genome Convergence Pipeline and Phylogenomics Dataset Browser. However, Google announced that they are ending support for Google Code, and from August all projects will be read-only.

I’ve therefore migrated all my projects to GithubThis will eventually include FAQs, forums and issue/bug tracking for the most popular software, BaTS and Genome Convergence API.

The projects can now be found at:


I am also changing how I respond to questions and bug requests. In the past I dealt with questions as they came in, with the odd explanatory post and a manual or readme with each release. Predictably, this meant I spent a lot of time dealing with duplicates or missing bugs or feature requests. I am now in the process of compiling a list of FAQs for each project, as well as uploading the manuals in markdown format so that I can update them with each release. Please bear with me as I go through this process. In the meantime, if you have an issue with a piece of software or think you have found a bug, please:

  1. Make sure you have the most recent version of the software. In most cases this will be available as an executable .jarfile on the project github page.
  2. Check the ‘Issues’ tab on the project github page. Your issue may be a duplicate, or already fixed by a new release. If your bug isn’t listed, please open a new issue giving as much detail as possible.
  3. Check the manual and FAQs to see if anyone else has had the same problem – I may well have answered their question already.
  4. If you still need an answer please email me on

Thanks so much for your support and involvement,


Embedding Artist profiles, playlists, and content from Spotify in HTML

Quick post this – turns out Spotify have added a really cool new function to their desktop application: You can now right-click any resource in Spotify (could be an artist, a playlist, a profile or a track or album) and get a link to the HTML code you need to embed it into another webpage. The link looks like this:

Untitled 2

The HTML is then copied to your clipboard, ready to drop into an artist webpage. Pretty cool eh? Let’s give it a spin:

<iframe src="" height="300" width="300" frameborder="0"></iframe>

Parsing numbers from multiple formats in Java

We were having a chat over coffee today and a question arose about merging data from multiple databases. At first sight this seems pretty easy, especially if you’re working with relational databases that have unique IDs (like, uh, a Latin binomial name – Homo sapiens) to hang from… right?

But, oh no.. not at all. One important reason is that seemingly similar data fields can be extremely tricky to merge. They may have been stated with differing precision (0.01, 0.0101, or 0.01010199999?), be encoded in different data types (text, float, blob, hex etc) or character set encodings (UTF-8 or Korean?) and even after all that, refer to subtly different quantities (mass vs weight perhaps). Who knew database ninjas actually earnt all that pay.

So it was surprising, but understandable, to learn that a major private big-data user (unnamed here) stores pretty much everything as text strings. Of course this solves one set of problems nicely (everyone knows how to parse/handle text, surely?) but creates another. That’s because it is trivially easy to code the same real-valued number in multiple different text strings – some of which may break sort algorithms, or even memory constraints. Consider the number ‘0.01’: as written there’s very little ambiguity for you and me. But what about:

” 0.01″ (note the space),
or even “0.01000000000”?

After a quick straw poll, we also realised that, although we knew how most of our most-used programming languages (Java for me, Perl, Python etc for others) performed the appropriate conversion in their native string-to-float methods. We knew how we thought they worked, and how we hoped they would, but it’s always worth checking. Time to write some quick code – here it is, on GitHub

And in code:


* Class to test the Float.parseFloat() method performance on text data
In particular odd strings which should be equal, e.g.
    <li>" 0.01" (note space)</li>

NB uses assertions to test - run JVM with '-ea' argument. The first three tests should pass in the orthodox manner. The fourth should throw assertion errors to pass.
* @author joeparker

public class TextToFloatParsingTest {

* Default no-arg constructor

public TextToFloatParsingTest(){
/* Set up the floats as strings*/
String[] floatsToConvert = {"0.01","00.01"," 0.01","0.0100"};
Float[] floatObjects = new Float[4];
float[] floatPrimitives = new float[4];

/* Convert the floats, first to Float objects and also cast to float primitives */
for(int i=0;i&lt;4;i++){
floatObjects[i] = Float.parseFloat(floatsToConvert[i]);
floatPrimitives[i] = floatObjects[i];

/* Are they all equal? They should be: test this. Should PASS */
/* Iterate through the triangle */
System.out.println("Testing conversions: test 1/4 (should pass)...");
for(int i=0;i&lt;4;i++){
for(int j=1;j&lt;4;j++){
assert(floatPrimitives[i] == floatPrimitives[j]);
assert(floatObjects[i] == floatPrimitives[j]);
System.out.println("Test 1/4 passed OK");

/* Test the numerical equivalent */
System.out.println("Testing conversions: test 2/4 (should pass)...");
for(int i=0;i&lt;4;i++){
assert(floatPrimitives[i] == 0.01f);
System.out.println("Test 2/4 passed OK");

/* Test the numerical equivalent inequality. Should PASS */
System.out.println("Testing conversions: test 3/4 (should pass)...");
for(int i=0;i&lt;4;i++){
assert(floatPrimitives[i] != 0.02f);
System.out.println("Test 3/4 passed OK");

/* Test the inversion */
/* These assertions should FAIL*/
System.out.println("Testing conversions: test 4/4 (should fail with java.lang.AssertionError)...");
boolean test_4_pass_flag = false;
for(int i=0;i&lt;4;i++){
for(int j=1;j&lt;4;j++){
assert(floatPrimitives[i] != floatPrimitives[j]);
assert(floatObjects[i] != floatPrimitives[j]);
test_4_pass_flag = true; // If AssertionErrors are thrown as we expect they will be, this is never reached.
// test_4_pass_flag should never be set true (line 62) if AssertionErrors have been thrown correctly.
System.err.println("Test 3/4 passed! This constitutes a logical FAILURE");
System.out.println("Test 4/4 passed OK (expected assertion errors occured as planned.");
public static void main(String[] args) {
// TODO Auto-generated method stub
new TextToFloatParsingTest();


If you run this with assertions enabled (‘/usr/bin/java -ea package’) you should get something like:

Testing conversions: test 1/4 (should pass)...
Test 1/4 passed OK
Testing conversions: test 2/4 (should pass)...
Test 2/4 passed OK
Testing conversions: test 3/4 (should pass)...
Test 3/4 passed OK
Testing conversions: test 4/4 (should fail with java.lang.AssertionError)...
Exception in thread "main" java.lang.AssertionError
Test 4/4 passed OK (expected assertion errors occured as planned.

HYPHY Hack: Passing arguments to HYPHY for phylogenetics using the command-line

Important update, 2017-Feb-07 ]

This solution, already a bit hacky, should now be considered a last-resort. Sergei and colleague Stephen Weaver have suggested a much more elegant solution; see:’ll still have to dive into the batch file you want to iterate over (to work out what user options are presented, in which order) but you should not have to edit the batch files themselves directly. The solution below may no longer work for some versions of HyPhy, owing to altered fscanf() behaviour. ]

HYHPY, is a great platform for advanced phylogenetics by Sergei L. Kosakovsky Pond, Simon D. W. Frost and Spencer V. Muse, where abstract concepts such as likelihood-ratio tests, model selection, and phylogenetic inference are represented and manipulated by means of a powerful and flexible object-oriented language called Hyphy Batch Language, or HBL, using workflows known as ‘batch files’ (actually more like routines). A large number (around a thousand) publications to date have made use of HYPHY, which includes additional features such as a GUI and ready-to-use implementations of advanced published methods. It also powers the online phylogenetics server.

However, for all this flexibility, HYPHY actually has an ugly side: Because the batch file system is so central to operations, there isn’t a convenient way to send pass arguments to HYPHY via the command-line. Yes, there are plenty of ways to get data into HYPHY at or before runtime (hard-coded options; reading in config files; dialog prompts on the command-line or GUI), but none that correspond to a standard POSIX-style program argument. In a phylogenomics context this caused our group some problems…

The problem

Let’s suppose we have a set of loci (perhaps a few thousand), with different names. An earlier pipeline has produced a set of subdirectories, one for each locus, with an alignment file and a phylogenetic tree in each. Say we want to run the same positive selection test (I’ll assume the branch-site random-effects likelihood test for this post, implemented already in HYPHY as the batch file) on each in HYPHY – how can we do that? We have a few options:

  1. Run HYPHY in GUI mode: This has the advantage of being easy to do. But it’s incredibly demanding of human input – who’s going to sit and click through thousands of HYPHY sessions? This input will also make it slower (depending on the analysis, the human component might be the limiting step); and it will certainly introduce the potential for human errors.
  2. Create a custom HYPHY batch file, and rename the input files in each locus: In other words, a script which looks for input files named something like ‘input.fasta‘ and ‘input.tre‘, and executes them. Unfortunately, there’s a risk we might over-write files we don’t want to, if one or more HYPHY calls fail part-way through. It could also be hard to parallelise this.
  3. Create a custom HYPHY batch file to loop through the input directories: This is how we probably ought to do things natively in the ‘HYPHY way’ – HBL is powerful enough to let us do things like read directory contents, split and test and generally manipulate strings etc. So we could probably work out how to write a wrapper batch file in HBL for HYPHY that would call . But do we really want to delve deeply into yet another language just to do that? And suppose we wanted to run the same analysis on another set of data in a month or so – we’d have to edit the wrapper file to loop through a different directory…
  4. What we really want to do is pass arguments to HYPHY using the command-line: That is, we want to be able to use the STDIN standard input stream to pass the input alignment and phylogeny files’ paths to HYPHY, read them into  as variables, and execute the batch file with no further input. This method will be flexible – we can use any paths we want, and change them at any time – and modular because we won’t have lots of different files sitting about for analyses at different times, just one.

It turns out that it’s actually pretty easy to do this – it took me an hour or so to work it out, and a couple more for implementation and testing – and with this guide you should be able to do it far quicker. There are several steps:

  1. Refactor the existing batch file to collect key variables
  2. Edit batch file to read variables from STDIN
  3. Call HYPHY in command-line mode, passing variables in-place as a ‘here’ string

That’s it! Here are the steps in detail:

1. Refactor the existing batch file to collect key variables

(NB: links to my hacked copies further down this page)

If you’re not familiar with HYPHY (and if you were, you probably wouldn’t be interested in this hack), this will be the intimidating bit. But relax: if you know C, Java, Perl, or any modernish procedural language, this is easy.

What we want to do is take the existing standard analysis batch file which came with HYPHY,, and work out all the places where HYPHY expects user input. We’ll need to either hardcode those, or pass variables from the command-line. To make this less likely to break, we’re going to a) work on a copy of the batch file (mine’s called, and b) refactor the code so all those variables are initialised right at the start of the batch file, where we can see them.

To start with, run the batch file in GUI mode as normal. This lets you check the input files are actually formatted correctly. Also note down all the points where the script asks for input, and what you want those inputs to be. In the REL test, the steps are: pick genetic code (‘universal’); input alignment (‘hyphy-input.fasta’); input phylogeny (‘hyphy-input.tre’); and output file (‘hyphy-output.REL’ but really, output file prefix – there’s several outputs in fact, which will share this prefix). Now we can go to the head of the copied file, and set these variables up. To start with, we’ll hardcode them. Later, we’ll read them from the command line via standard input. I’ve used ALL_CAPS variables for readability, not that HBL cares:

/* Variables we'll define and later set by STDIN */
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";

/* Start of normal batch file */
skipCodeSelectionStep = 0;


So the four variables we’ve introduced are: JOE_HARDCODE_ALIGNMENT; JOE_HARDCODE_PHYLOGENY; JOE_HARDCODE_GENETIC_CODE; and JOE_HARDCODE_OUTPUT. We’ve defined these, but they’re not actually used anywhere yet – as things stand, HYPHY will still try and ask the user for input. What we need to do instead is go through the batch file looking for methods that prompt the user for input, and replace them with our variables instead. From a quick read of the HBL documentation (nb, the HTML documentation that comes with HYPHY is more useful), there seem to be two main ways HYPHY gets user input. They are:

/* fscanf() - reads input to a variable, e.g from console (command-line) to a string, as here: */
/* PROMPT_FOR_FILE, a special variable that opens a system dialog/file chooser, as here: */
DataSet ds = ReadDataFile(PROMPT_FOR_FILE);

All we need to do is look through the batch files and the places where the user interactions we noted in our GUI session happened, and replace the fscanf()‘s or PROMPT_FOR_FILE‘s with our variables. Then when we change the variables from being hardcoded to being passed as arguments at the command-prompt, we’ll have our complete program. In the case of, there are in fact a number of included scripts (additional batch files or model definition files) used in the analysis – so in some cases we need to change those too. Make sure to use copies and rename them…

The datafile (alignment)
This is found in, as above. This line is easy to find and change:

DataSet ds = ReadDataFile(PROMPT_FOR_FILE);
to our initialised JOE_HARDCODE_ALIGNMENT

Make sure to _replace_ 'PROMPT_FOR_FILE'
or comment out the old line if you want to copy it! */

DataSet ds = ReadDataFile(JOE_HARDCODE_ALIGNMENT);

The output files’ prefix
This is found in, as above. Also easy, although PROMPT_FOR_FILE is used in an odd context:

SetDialogPrompt ("Save analysis results to");
fprintf (PROMPT_FOR_FILE, CLEAR_FILE, KEEP_OPEN,"Branch,Mean_dNdS,Omega1,P1,Omega2,P2,Omega3,P3,LRT,p,p_Holm")
fprintf (JOE_HARDCODE_OUTPUT, CLEAR_FILE, KEEP_OPEN,"Branch,Mean_dNdS,Omega1,P1,Omega2,P2,Omega3,P3,LRT,p,p_Holm");

The tree (phylogeny)
Annoyingly, this is found in a required batch file, not the main one. It’s found in, so we need to locate this file, rename it, edit it, and also edit the place where it is called so that our hacked version is called instead. itself is found in the same directory (TemplateBatchFiles) as I copied it to Within this the relevant line is, with a similar syntax to the output file:

SetDialogPrompt ("Please select a tree file for the data:");

fscanf (PROMPT_FOR_FILE, REWIND, "Raw", treeString);
/* As before, replace PROMPT_FOR FILE
with our phylogeny variable. In my case,

fscanf (JOE_HARDCODE_PHYLOGENY, REWIND, "Raw", treeString);

Because this is an external function library, we need to find where in it’s imported, and make sure our hacked copy is instead. We need

LoadFunctionLibrary ("queryTree");
/* Replace with our
(the *.bf suffix isn't needed) */

LoadFunctionLibrary ("queryTree_hardcode");

The genetic code translation definitions
The genetic code translation type is also handled in an external library, chooseGeneticCode.def, but annoyingly, this isn’t in TemplateBatchFiles, but a TemplateBatchFiles/TemplateModels subdirectory. Such is life… again, I’ve worked on a copy, chooseGeneticCode_HardcodeUniversal.def, and after modifying the library itself we need to edit the library call to make sure our hacked version is pulled in. First, the edit, which uses a slightly different, but still intuitive syntax, found at chooseGeneticCode.def:95:

if (!skipCodeSelectionStep)
/* this is where the user input routine ChoiceList() is called... */
ChoiceList (modelType,"Choose Genetic Code",1,SKIP_NONE,_geneticCodeOptionMatrix);

if (modelType &lt; 0)
/* but this is where the variable is actually set... */
ApplyGeneticCodeTable (modelType);
/* ... so we'll replace modelType with our global JOE_HARDCODE_GENETIC_CODE variable */

The corresponding call to TemplateModels.chooseGeneticCode.def in is right back at line 2:

skipCodeSelectionStep = 0;
/* Replace the default library with our hacked one -
Note that the subdirectory path isn't needed; the TemplateModels subdirectory is searched by default */



2. Edit batch file to read variables from STDIN

Phew! Good news is that was the fiddly bit; the rest of this is all easy. The next step is to replace the hardcoded variable initalisations at the head of our copy with fscanf() methods that will assign values to these variables from the standard input (command-line). So we’ll comment out:

/* Variables we'll define and later set by STDIN */
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";
/* Start of normal batch file */
skipCodeSelectionStep = 0;

And replace them with:

/* Variables we'll define and later set by STDIN */
/* comment out the hardcoded definitions ...
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";

And replace with stdin read via fscanf(): */

JOE_HARDCODE_GENETIC_CODE = 1; // OK, we'll keep this one hardcoded for now
/* Start of normal batch file */
skipCodeSelectionStep = 0;

These are pretty self-explanatory. Done!

3. Call HYPHY in command-line mode, passing variables in-place as a ‘here’ string

At this point, all we’ve really done is refactor the batch file. We’ve moved where the variables are initalised / set, so that we can find them easily, and we’ve called fscanf() on each them in order to set them. So far, because the implies someone, somehow, will need to type stuff into stdin at a prompt, this doesn’t actually solve our main problem – how to pass variables on the command line to HYPHY – but what it has done is made everything a lot neater. Note that these are still three separate calls to fscanf(), however – which means HYPHY will expect three discrete chunks of user interaction. In a nutshell, if we ran HYPHY now, we’d get something like:

>HYPHY: Please choose a data file:
me: /foo/bar/hyphy_input.fa

>HYPHY: Please select a tree:
me: /foo/bar/hyphy_input.tre

>HYPHY: Please choose a file for output:
me: /foo/bar/hyphy_output.REL

So we need to get bash to accept input from a file or command-line, and pass it onto HYPHY each time HYPHY wants input. The easy way to do this is to put each user response on a separate line in a file, and use the ‘<‘ switch to redirect the standard input stream to this file, instead of the keyboard. This might look a bit like:

# in:
hyphy-input.fasta # the alignment
hyphy-input.tre # the tree
hyphy-output.REL #the output

# HYPHYMP (the binary) could then be called with:
$user~: HYPHYMP &lt;

But that wouldn’t really help us, would it? We’d have to edit separately for each locus! Luckily there is a handy Bash trick which I had to search for a bit – the ‘here’ string (I found this on LinuxJournal). This lets us redirect a string in-place to the command-line, and takes the form:

$user~: command <<<'input_string_to_stdin'

Remembering that we had three fscanf() calls, one for each of our refactored variables, we’ll need three inputs. No problem (StackExchange to the rescue) – we can separate the inputs with newline (‘\n’) characters (we’ll also need the ‘$’ operator, to make sure bash interprets the newlines correctly), like this:

$user~: command <<<$'input_1\ninput_2\ninput_3'

This syntax is equivalent to giving the command command three separate and sequential inputs.

Putting it all together

Finally we’ve got everything we need to run HYPHY in command-line mode. To recap:

  • A command-line friendly version of HYPHY (see this post);
  • The edited versions of, chooseGeneticCode.def and, renamed and in place alongside their original copies;
  • Input alignment and tree files, and a writeable output directory;
  • A means (the ‘here’ or ‘<<<‘ operator) of sending multiple-line inputs to the standard input stream.

Running HYPHY on the command line with arguments passed

Let’s do this! There are a couple of additional options (CPU=integer, which sets the number of cores, and BASEPATH=/path/to/batchfiles, which ensures the right batchfile directory is being used) but don’t worry about those for now.

The complete command is :

/usr/local/bin/HYPHYMP CPU=number_of_cpu_cores BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ &lt;&lt;&lt;$'/path/to/hyphy_input.fa\n/path/to/hyphy_input.tre\n/path/to/hyphy_output.REL'

You can optionally use stuff like >log.out and 2>log.err to redirect STDOUT and STDERR if you want; also & to fork and leave running etc. But the critical bit of this command is the last bit, after the ‘<<<‘ handle. I’ve only tested this using absolute/full pathnames for the input/output file arguments – it’s a pain but less likely to break in the short-term (what happens if you move the whole project folder is another matter…)

I admit this looks absolutely horrible. But it’s the best I can do.

In practice

So for me (user=jparker) working from /Downloads/hyphy_hacks/hackinput with alignments hyphy-input.fa and hyphy-input.tre, and outputting to files with prefix run2, the complete command is:

/usr/local/bin/HYPHYMP CPU=2 BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ &lt;&lt;&lt;;$'/home/jparker/Downloads/hyphy_hacks/hackinput/hyphy-input.fa\n/home/jparker/Downloads/hyphy_hacks/hackinker/Downloads/hyphy_hacks/hackinput/run2'

And if I don’t want to wait for it to complete, and send stdout and stderr to some files, the command is:

/usr/local/bin/HYPHYMP CPU=2 BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ &lt;&lt;&lt;$'/home/jparker/Downloads/hyphy_hacks/hackinput/hyphy-input.fa\n/home/jparker/Downloads/hyphy_hacks/hackinker/Downloads/hyphy_hacks/hackinput/run4' &gt;run4.stdout 2&gt;run4.err &amp;

Lastly you can change the argument to the CPU= command if you want to. Be aware that by default HYPHYMP uses as many cores as it can see (I think)…

Migrating to OS X Mavericks

The time has come, my friends. I am upgrading from 10.6.8 (‘Snow Leopard’) to 10.9 (‘Mavericks’) on my venerable and mistreated MacBook Pros (one is 2010 with a SATA drive, the other 2011 with an SSD). Common opinion holds that the 2010 machine might find it a stretch so I’m starting with the 2010/SSD model first. Also, hey, it’s a work machine, so if I truly bork it, Apple Care should (should) cover me…


At least Apple make the upgrade easy enough to get: for the last year or so, Software Update has been practically begging me to install the App Store. Apple offer OSX 10.9 for free through this platform (yes! FREE!!) so it’s a couple of clicks to download and start the installer…


Obviously I’ve backed up everything several times: to Time Machine, on an external HDD; to Dropbox; Drobo; and even the odd USB stick lying around as well as my 2010 MBP and various other machines I have access to. As well as all this, I’ve actually tried to empty the boot disk a bit to make space – unusually RTFM for me – and managed to get the usage down to about 65% available space. I’ve also written down every password and username I have, obviously on bombay mix-flavoured rice-paper so I can eat them after when everything (hopefully) works.


Click the installer. Agree to a few T&Cs (okay, several, but this is Apple we’re talking about). Hit ‘Restart’. Pray…


… And we’re done! That was surprisingly painless. The whole process took less than two hours on my office connection, from download to first login. There was a momentary heart attack when the first reboot appeared to have failed and I had to nudge it along, but so far (couple of days) everything seems to be running along nicely.

Now, I had worried (not unreasonably, given previous updates) that my computer might slow down massively, or blow up altogether. So far this doesn’t seem to have happened. The biggest downsides are the ones I’d previously read about and unexpected: e.g. PowerPC applications like TreeEdit and Se-Al aren’t supported any more. Apparently the main workaround for this is a 10.6.8 Server install inside Parallels, but I’ll look into this more in a future post when I get a chance.

was a bit surprised to find that both Homebrew and, even more oddly, my SQL installation needed to be reinstalled, but a host of other binaries didn’t. Presumably there’s a reason for this but I can’t find it. Luckily those two at least install pretty painlessly, but it did make me grateful nothing else broke (yet).

So what are the good sides? The general UI is shiny, not that this matters much in a bioinformatics context, and smart widgets like Notifications are pretty, but to be honest, there aren’t any really compelling reasons to switch. I’ve not used this machine as a laptop much so far, so I can’t comment on the power usage (e.g. stuff like App Nap) yet, although it seems to be improved… a bit.. and I haven’t had time to run any BEAST benchmarks to see how the JVM implementation compares. But there is one massive benefit: this is an OS Apple are still supporting! This matters because stuff like security and firmware updates really do matter, a lot – and release cycles are getting ever shorter, especially as Macs get targeted more. In short: I couldn’t afford to stay behind any longer!

Update [5 Oct 2014]: Given the Shellshock bash exploit affects both 10.6 and 10.9, but Apple aren’t – as yet – releasing a patch for 10.6, while they rushed a 1.0 patch for 10.9 in less than a week, the security aspect of this upgrade is even more clearly important…

Update [23 Oct 2014]: Nope, I won’t be upgrading to Yosemite for a while, either!