Category Archives: Coding

Some aspects of BLASTing long-read data

Quick note to explain some of the differences we’ve observed working with long-read data (MinION, PacBio) for sample ID via BLAST. I’ll publish a proper paper on this, but for now:

  • Long reads aren’t just a bit longer than Illumina data, but two, three, four or possibly even five orders of magnitude longer (up to 10^6 already, vs 10^2). This is mathematically obvious, but extremely important…
  • … the massive length means lots of the yield is in comparatively few reads. This makes yield stats based on numbers of reads positively useless for comparison with NGS. Also…
  • Any given long read contains significantly more information than a short one does. Most obviously the genomics facilities of the world have focused on their potential for improving genome assembly contiguity and repeat spanning (as well as using synteny to spot rearrangements etc) but we’ve also shown (Parker et al, submitted) that whole coding loci can be directly recovered from single reads and used in phylogenomics without assembly and annotation. This makes sense (a ~kb long read can easily span a whole gene, also ~kb in scale) but it certainly wasn’t initially obvious, and given error rates, etc, it’s surprising it actually works.
  • Sample ID using BLAST actually works very differently. In particular, the normal ‘rubbish in, rubbish out’ rule is inverted. In other words, nanopore reads (for the time being) may be long, but inevitably contain errors. However, this length means that assuming BLAST database sequences are approximately as long/contiguous, Nanopore queries tend to either match database targets correctly, with very long alignments (hundreds/thousands of identities), or not at all.

This last point is the most important. What it means is that, for a read, interpreting the match is simple – you’ll either have a very long alignment to a target, or you won’t. Even when a read has regions of identity to more than one species, the correct read has a much longer cumulative alignment length overall for the correct one. This is the main result of our paper.

The second implication is that, as it has been put to me, for nanopore reads to be any good for an ID, you have to have a genomic database. While this is true in the narrow sense, our current work (and again, this is partly in our paper, and partly in preparation) shows that in fact all that matters is for the length distribution in the reference database to be similar to the query nanopore one. In particular, we’ve demonstrated that a rapid nanopore sequencing run, with no assembly, can itself serve as a perfectly good reference for future sample ID. This has important implications for sample ID but as I said, more on that later 😉

Step-by-step: Raspberry Pi as a wifi bridge, plus a (really) low-spec media centre…

I’ll keep this brief, really so, because this is mainly an aide-memoire for when this horrific bodge breaks in the next, ooh, month or so. But, for context:

The problem:

Our office/studios are in a shed at the bottom of the garden (~15m). Wifi / wireless LAN just reaches, intermittently.

The solution:

Set up an ethernet network in the shed itself, and connect (‘bridge’) that network to the house wifi with a Raspberry Pi.

Kit:

1x Raspberry Pi (Pi 2 Model B; mine overclocked to ~1150MHz) plus SD card and reader; an old ethernet switch and cables; quite a lot of patience.


A bit more detail:

This step-by-step is going to be a bit arse-about-face, in that the order of the steps you’d actually need from scratch is completely different from the max-frustration, highly circuitous route I actually followed. Not least because I already had a Pi with Ubuntu on:

  1. Get a Pi with Ubuntu on it. This will be acting as the wireless bridge to connect the LAN to the wifi; and also serve IP addresses to other hosts on the LAN (network buffs: yes, I realise this is a crap solution). This is the second-easiest step by a mile; see: this guide for MATE and follow it. We’ll set the Pi up to run without a monitor or keyboard (‘headless’ – connecting over SSH) later, but for now don’t ruin your relationship unduly, do this bit the easy way with a monitor attached.
  2. MAKE SURE YOU ChANGE THE DEFAULT UNAME AND PASSWORD ON THE PI, AND WRITE THEM DOWN. Jeez…
  3. apt-get update the Pi a few times. You’ll thank yourself later.
  4. Set the Pi up to act as a wifi <–> LAN bridge. There are a lot of tutorials suggesting various ways to achieve this such as this, this, and all of this noise. But ignore them all – with the newest Ubuntu LTS (16.04 at time of writing) this is now far, far, far easier to do in the GUI, and more stable. Just follow this guide.
  5. Set up some other housekeeping tasks for headless login: enable SSH (see also here); set the clock to attempt to update the system time on boot if a time server’s available (make sure to add e.g. server 0.europe.pool.ntp.org to your /etc/ntp.conf file) and login to the desktop automatically. This last action isn’t necessary, and purists will claim it wastes resources, but this is a Pi 2 and we’re only serving DCHP on it, basically – it can afford that. The reason I’ve enabled this is because it seems to force the WLAN adapter to try to acquire the home wifi a bit more persistently (see below). I’ve tried to achieve the same results using wpa_supplicant, but with no stability and my time is a pretty finite resource, so screw it – I’m a scientist, not an engineer!
  6. Lastly, I’ve made some fairly heavy-duty edits (not following but at least guided by this and this) to my /etc/network/interfaces file, with a LOT of trial and error which included a couple of false starts bricking my Pi (if that happens to you, reinstall Ubuntu. Sorry.) It now reads (home wifi credentials redacted):
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    # interfaces(5) file used by ifup(8) and ifdown(8)
    # Include files from /etc/network/interfaces.d:
    source-directory /etc/network/interfaces.d# The loopback network interface
    auto lo
    iface lo inet loopback

    # LOOK at all the crap I tried...
    #allow-hotplug wlan0
    #iface eth0 inet dhcp
    #allow-hotplug wlan0
    #iface wlan0 inet manual
    #iface eth0 inet auto
    #wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf
    # Yep, that lot took a while :\

    # Finally, this worked:
    auto wlan0
    iface wlan0 inet dhcp
    wpa-ssid "q***********"
    wpa-psk "a******"
    # That's it :D
  7. Connect the Pi to your other computers using the switch and miles of dodgy ethernet cabling.
  8. Disconnect the screen, reboot, and wait for a long time – potentially hours – for the Pi to acquire the wifi. You should now be able to a) ping and/or login to the Pi from other hosts on the LAN, and b) ping/access hosts on the home WLAN, and indeed the wider Internet if your WLAN has a connection(!)

A Media centre from scratch

Lastly of all, having gone to all that trouble, the glaring bandwidth inadequacies of our crap WLAN showed up. Being stingy by nature (well, and because the phone companies in our area insist that, despite living fewer than a day’s march from Westminster, their exchanges have run out of fibre capacities for 21st-century broadband) I decided to mitigate this for the long winter months the simplest way: gather the zillions of mp3s, ripped DVDs and videos from all our devices onto one server. I put an Ubuntu (the same 16.04 / MATE distribution as on the Pi, in fact) onto an old Z77 motherboard my little brother had no use for, in an ancient (~2003) ATX case, with a rock-bottom Celeron new CPU (~£25) plus 4MB SDRAM and cheap spinning drive I had lying about (a 2TB Toshiba SATA, IIRC). This is highly bodgy. So much so, in fact, that the CPU fan is cable-tied onto the mobo, because the holes for the anchor pins didn’t line up. But: it works, and only has to decode/serve MP3s and videos, after all.

I apt-get updated that a few times, plus adding in some extra packages like htop, openssh, and hardinfo – plus removing crap like games and office stuff – to make it run about as leanly as possible. Then, to manage and serve media I installed something I’d wanted to play with for a while: Kodi. This is both a media manager/player (like iTunes, VLC, or Windows Media Player) and also streaming media server, so other hosts on my LAN can access the library by streaming over the ethernet if they want without replicating files.

Setting up Kodi was simplicity itself, as was adding movies and music to the library, but one minor glitch I encountered was reading/displaying track info and artwork, which usually happens on iTunes pretty seamlessly via ID3 tags, fingerprinting, and/or Gracenote CDDB querying. Turns out I’d been spoilt this last decade, because in Kodi this doesn’t happen so magically. Instead, you need to use a tool like MusicBrainz Picard to add the tags to MP3s on your system, then re-scan them into Kodi for the metadata to be visible. The re-scanning bit isn’t as onerous as you’d think – files are left in place, the ID3 tags being used simply to update Kodi’s metadata server (I guess) – but the initial Picard search for thousands of MP3s over a slow WLAN took me most of a night.

However. A small price to pay to actually have music to listen to while I work away writing crap like this in the shed, or shoddy-quality old episodes of Blackadder or Futurama to watch in the evening :p

Copying LOADS of files from a folder of LOADS *AND* LOADS more in OSX

Quick one this, as it’s a tricky problem I keep having to Google/SO. So I’m posting here for others but mainly myself too!

Here’s the situation: you have a folder (with, ooh, let’s say 140,000 separate MinION reads, for instance…) which contains a subset of files you want to move or copy somewhere else. Normally, you’d do something simple like a wildcarded ‘cp’ command, e.g.:

1
host:joeparker$ cp dir/*files_I_want* some/other_dir

Unfortunately, if the list of files matched by that wildcard is sufficiently long (more than a few thousand), you’ll get an error like this:

1
-bash: /bin/cp: Argument list too long

In other words, you’re going to have to be more clever. Using the GUI/Finder usually isn’t an option either at this point, as the directory size will likely defeat Finder, too. The solution is pretty simple but takes a bit of tweaking to work in OSX (full credit to posts here and here that got me started).

Basically, we’re going to use the ‘find’ command to locate the files we want, then pass each one in turn as an argument to ‘cp’ using ‘find -exec’. This is a bit slower overall than doing the equivalent as our original wildcarded command but since that won’t work we’ll have to lump it! The command* is:

1
find dir -name *files_I_want* -maxdepth 1 -exec cp {} some/other_dir \;</p>

Simple, eh? Enjoy :)

*NB, In this command:

  • dir is the filesystem path to start the search; ‘find’ will recursively traverse the directory tree including and below this folder;
  • -name *glob* gives the files to match;
  • -maxdepth is how deep to recurse (e.g. 1 = ‘don’t recurse’);
  • cp is the command we’re executing on each found file (could be mv etc);
  • {} is the pipe argument standing for the found file;
  • some/other_dir is the destination argument to the command invoked by -exec

How to fake an OSX theme appearance in Linux Ubuntu MATE

I’ve recently been fiddling about and trying to fake an OSX-style GUI appearance in Linux Ubuntu MATE (15.04). This is partly because I prefer the OSX GUI (let’s be honest) and partly because most of my colleagues are also Mac users mainly (bioinformaticians…) and students in particular fear change! The Mac-style appearance seems to calm people down. A bit.

The specific OS I’m going for is 10.9 Mavericks, because it’s my current favourite and nice and clear. There are two main things to set up: the OS itself and the appearance. Let’s take them in turn.

1. The OS

I’ve picked Ubuntu (why on Earth wouldn’t you?!) and specifically the MATE distribution. This has a lot of nice features that make it fairly Mac-y, and in particular the windowing and package management seem smoother to me than the vanilla Ubuntu Unity system. Get it here: https://ubuntu-mate.org/vivid/.* The installation is pretty painless on Mac, PC or an existing Linux system. If in doubt you can use a USB as the boot volume without affecting existing files; with a large enough partition (the core OS is about 1GB) you can save settings – including the customisation we’re about to apply!

*We’re installing the 15.04 version, not the newest release, as 15.04 is an LTS (long-term stable) Ubuntu distribution. This means it’s supported officially for a good few years yet. [Edit: Arkash (see below) kindly pointed out that 14.04 is the most recent LTS, not 15.04. My only real reason for using 15.04 therefore is ‘I quite like it and most of the bugs have gone'(!)]

2. The appearance

The MATE windowing system is very slick, but the green-ness is a bit, well, icky. We’re going to download a few appearance mods (themes, in Ubuntu parlance) which will improve things a bit. You’ll need to download these to your boot/install USB:

Boot the OS

Now that we’ve got everything we need, let’s boot up the OS. Insert the USB stick into your Mac-envious victim of choice, power it up and enter the BIOS menu (F12 in most cases) before the existing OS loads. Select the USB drive as the boot volume and continue.

Once the Ubuntu MATE session loads, you’ll have the option of trialling the OS from the live USB, or permanently installing it to a hard drive. For this computer I won’t be installing to a hard drive (long story) but using the USB, so customising that. Pick either option, but beware that customisations to the live USB OS will be lost should you later choose to install to a hard drive.

When you’re logged in, it’s time to smarten this baby up! First we’ll play with the dock a bit. From the top menu bar, select “System > Preferences > MATE Tweak” to open the windowing management tool. In the ‘Interface’ menu, change Panel Layouts to ‘Eleven’ and Icon Size to ‘Small’. In the ‘Windows’ menu, we’ll change Buttons Layout to ‘Contemporary (Left)’. Close the MATE Tweak window to save. This is already looking more Mac-y, with a dock area at the bottom of the screen, although the colours and icons are off.

Now we’ll apply some theme magic to fix that. Select “System > Preferences > Look and Feel > Appearance”. Now we can customise the appearance. Firstly, we’ll load both the ‘Ultra-Flat Yosemite Light’ and ‘OSX-MATE’ themes, so they’re available to our hybrid theme. Click the ‘Install..’ icon at the bottom of the theme selector, you’ll be able to select and install the Ultra-Flat Yosemite Light theme we downloaded above. It should unpack from the .zip archive and appear in the themes panel. Installing the OXS-MATE theme is slightly trickier:

  • Unzip (as sudo) the OSX-MATE theme to /usr/share/themes
  • Rename it from OSX-MATE-master to OSX-MATE if you downloaded it from git as a whole repository (again, you’ll need to sudo)
  • Restart the appearances panel and it should now appear in the themes panel.

We’ll create a new custom theme with the best bits from both themes, so click ‘Custom’ theme, then ‘Customise..’ to make a new one. Before you go any further, save it under a new name! Now we’ll apply changes to this theme. There are five customisations we can apply: Controls, Colours, Window Border, Icons and Pointer:

  • ControlsUltra-Flat Yosemite Light
  • Colours: There are eight colours to set here. Click each colour box then in the ‘Colour name’ field, enter:
    • Windows (foreground): #F0EAE7 / (background): #0F0F0E
    • Input boxes (fg): #FFFFFF / (bg): #0F0F0E
    • Selected items (fg): #003BFF / (bg): #F9F9F9
    • Tooltips: (fg): #2D2D2D / (bg): #DEDEDE
  • Window borderOSX-MATE
  • IconsFog
  • PointerDMZ (Black)

Save the theme again, and we’re done! Exit Appearance Preferences.

Finally we’ll install Solarized as the default terminal (command-line interface) theme, because I like it. In the MATE Terminal, Unzip the solarized-mate-terminal archive, as sudo. Enter the directory and simply run (as sudo) the install script using bash:


$ sudo unzip solarized-mate-terminal
$ cd solarized-mate-terminal
$ bash solarized-mate.sh

Close and restart the terminal. Hey presto! You should now be able to see the light/dark Solarized themes available, under ‘Edit > Profiles’. You’ll want to set one as the default when opening a new terminal.

Finally…

Later, I also installed Topmenu, a launchpad applet that gives an OSX-style top-anchored application menu to some linux programs. It’s a bit cranky and fiddly though, so you might want to give it a miss. But if you have time on your hands and really need that Cupertino flash, be my guest. I hope you’ve had a relatively easy install for the rest of this post, and if you’ve got any improvements, please let me know!

Happy Tweaking…

BaTS (and Befi-BaTS), SHiAT, and Genome Convergence Pipeline have moved!

Important – please take note!
Headline:

  • All my phylogenetics software is now on GitHub, not websites or Google Code
  • Please use the new FAQ pages and issue/bug tracker forms, rather than emailing me directly in the first instance

Until now, I’ve been hosting the open-sourced parts of my phylogenetics software on code.google.com. These include the BaTS (and Befi-BaTS) tools for phylogeny-trait association correlations; the alignment profilers SHiAT (and Genious Entropy plugin), and the Genome Convergence API for the Genome Convergence Pipeline and Phylogenomics Dataset Browser. However, Google announced that they are ending support for Google Code, and from August all projects will be read-only.

I’ve therefore migrated all my projects to GithubThis will eventually include FAQs, forums and issue/bug tracking for the most popular software, BaTS and Genome Convergence API.

The projects can now be found at:

 

I am also changing how I respond to questions and bug requests. In the past I dealt with questions as they came in, with the odd explanatory post and a manual or readme with each release. Predictably, this meant I spent a lot of time dealing with duplicates or missing bugs or feature requests. I am now in the process of compiling a list of FAQs for each project, as well as uploading the manuals in markdown format so that I can update them with each release. Please bear with me as I go through this process. In the meantime, if you have an issue with a piece of software or think you have found a bug, please:

  1. Make sure you have the most recent version of the software. In most cases this will be available as an executable .jarfile on the project github page.
  2. Check the ‘Issues’ tab on the project github page. Your issue may be a duplicate, or already fixed by a new release. If your bug isn’t listed, please open a new issue giving as much detail as possible.
  3. Check the manual and FAQs to see if anyone else has had the same problem – I may well have answered their question already.
  4. If you still need an answer please email me on joe+bioinformaticshelp@kitserve.org.uk

Thanks so much for your support and involvement,

Joe

Embedding Artist profiles, playlists, and content from Spotify in HTML

Quick post this – turns out Spotify have added a really cool new function to their desktop application: You can now right-click any resource in Spotify (could be an artist, a playlist, a profile or a track or album) and get a link to the HTML code you need to embed it into another webpage. The link looks like this:

Untitled 2

The HTML is then copied to your clipboard, ready to drop into an artist webpage. Pretty cool eh? Let’s give it a spin:

1
<iframe src="https://embed.spotify.com/?uri=spotify%3Aartist%3A4qsWY8X6Yq3TTVe4gn6cnL" height="300" width="300" frameborder="0"></iframe>


Parsing numbers from multiple formats in Java

We were having a chat over coffee today and a question arose about merging data from multiple databases. At first sight this seems pretty easy, especially if you’re working with relational databases that have unique IDs (like, uh, a Latin binomial name – Homo sapiens) to hang from… right?

But, oh no.. not at all. One important reason is that seemingly similar data fields can be extremely tricky to merge. They may have been stated with differing precision (0.01, 0.0101, or 0.01010199999?), be encoded in different data types (text, float, blob, hex etc) or character set encodings (UTF-8 or Korean?) and even after all that, refer to subtly different quantities (mass vs weight perhaps). Who knew database ninjas actually earnt all that pay.

So it was surprising, but understandable, to learn that a major private big-data user (unnamed here) stores pretty much everything as text strings. Of course this solves one set of problems nicely (everyone knows how to parse/handle text, surely?) but creates another. That’s because it is trivially easy to code the same real-valued number in multiple different text strings – some of which may break sort algorithms, or even memory constraints. Consider the number ‘0.01’: as written there’s very little ambiguity for you and me. But what about:

“0.01”,
“00.01”,
” 0.01″ (note the space),
or even “0.01000000000”?

After a quick straw poll, we also realised that, although we knew how most of our most-used programming languages (Java for me, Perl, Python etc for others) performed the appropriate conversion in their native string-to-float methods. We knew how we thought they worked, and how we hoped they would, but it’s always worth checking. Time to write some quick code – here it is, on GitHub

And in code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
package uk.ac.qmul.sbcs.evolution.sandbox;

/**
* Class to test the Float.parseFloat() method performance on text data
*
In particular odd strings which should be equal, e.g.
*
<ul>
    <li>"0.01"</li>
    <li>"00.01"</li>
    <li>" 0.01" (note space)</li>
    <li>"0.0100"</li>
</ul>
*

NB uses assertions to test - run JVM with '-ea' argument. The first three tests should pass in the orthodox manner. The fourth should throw assertion errors to pass.
* @author joeparker
*
*/

public class TextToFloatParsingTest {

/**
* Default no-arg constructor
*/

public TextToFloatParsingTest(){
/* Set up the floats as strings*/
String[] floatsToConvert = {"0.01","00.01"," 0.01","0.0100"};
Float[] floatObjects = new Float[4];
float[] floatPrimitives = new float[4];

/* Convert the floats, first to Float objects and also cast to float primitives */
for(int i=0;i&lt;4;i++){
floatObjects[i] = Float.parseFloat(floatsToConvert[i]);
floatPrimitives[i] = floatObjects[i];
}

/* Are they all equal? They should be: test this. Should PASS */
/* Iterate through the triangle */
System.out.println("Testing conversions: test 1/4 (should pass)...");
for(int i=0;i&lt;4;i++){
for(int j=1;j&lt;4;j++){
assert(floatPrimitives[i] == floatPrimitives[j]);
assert(floatObjects[i] == floatPrimitives[j]);
}
}
System.out.println("Test 1/4 passed OK");

/* Test the numerical equivalent */
System.out.println("Testing conversions: test 2/4 (should pass)...");
for(int i=0;i&lt;4;i++){
assert(floatPrimitives[i] == 0.01f);
}
System.out.println("Test 2/4 passed OK");

/* Test the numerical equivalent inequality. Should PASS */
System.out.println("Testing conversions: test 3/4 (should pass)...");
for(int i=0;i&lt;4;i++){
assert(floatPrimitives[i] != 0.02f);
}
System.out.println("Test 3/4 passed OK");

/* Test the inversion */
/* These assertions should FAIL*/
System.out.println("Testing conversions: test 4/4 (should fail with java.lang.AssertionError)...");
boolean test_4_pass_flag = false;
try{
for(int i=0;i&lt;4;i++){
for(int j=1;j&lt;4;j++){
assert(floatPrimitives[i] != floatPrimitives[j]);
assert(floatObjects[i] != floatPrimitives[j]);
test_4_pass_flag = true; // If AssertionErrors are thrown as we expect they will be, this is never reached.
}
}
}finally{
// test_4_pass_flag should never be set true (line 62) if AssertionErrors have been thrown correctly.
if(test_4_pass_flag){
System.err.println("Test 3/4 passed! This constitutes a logical FAILURE");
}else{
System.out.println("Test 4/4 passed OK (expected assertion errors occured as planned.");
}
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
new TextToFloatParsingTest();
}

}


If you run this with assertions enabled (‘/usr/bin/java -ea package uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest’) you should get something like:

Testing conversions: test 1/4 (should pass)...
Test 1/4 passed OK
Testing conversions: test 2/4 (should pass)...
Test 2/4 passed OK
Testing conversions: test 3/4 (should pass)...
Test 3/4 passed OK
Testing conversions: test 4/4 (should fail with java.lang.AssertionError)...
Exception in thread "main" java.lang.AssertionError
    at uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest.<init>(TextToFloatParsingTest.java:60)
    at uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest.main(TextToFloatParsingTest.java:76)
Test 4/4 passed OK (expected assertion errors occured as planned.

HYPHY Hack: Passing arguments to HYPHY for phylogenetics using the command-line

Important update, 2017-Feb-07 ]

This solution, already a bit hacky, should now be considered a last-resort. Sergei and colleague Stephen Weaver have suggested a much more elegant solution; see: https://github.com/veg/hyphy/issues/522You’ll still have to dive into the batch file you want to iterate over (to work out what user options are presented, in which order) but you should not have to edit the batch files themselves directly. The solution below may no longer work for some versions of HyPhy, owing to altered fscanf() behaviour. ]

HYHPY, is a great platform for advanced phylogenetics by Sergei L. Kosakovsky Pond, Simon D. W. Frost and Spencer V. Muse, where abstract concepts such as likelihood-ratio tests, model selection, and phylogenetic inference are represented and manipulated by means of a powerful and flexible object-oriented language called Hyphy Batch Language, or HBL, using workflows known as ‘batch files’ (actually more like routines). A large number (around a thousand) publications to date have made use of HYPHY, which includes additional features such as a GUI and ready-to-use implementations of advanced published methods. It also powers the datamonkey.org online phylogenetics server.

However, for all this flexibility, HYPHY actually has an ugly side: Because the batch file system is so central to operations, there isn’t a convenient way to send pass arguments to HYPHY via the command-line. Yes, there are plenty of ways to get data into HYPHY at or before runtime (hard-coded options; reading in config files; dialog prompts on the command-line or GUI), but none that correspond to a standard POSIX-style program argument. In a phylogenomics context this caused our group some problems…

The problem

Let’s suppose we have a set of loci (perhaps a few thousand), with different names. An earlier pipeline has produced a set of subdirectories, one for each locus, with an alignment file and a phylogenetic tree in each. Say we want to run the same positive selection test (I’ll assume the branch-site random-effects likelihood test for this post, implemented already in HYPHY as the BranchSiteREL.bf batch file) on each in HYPHY – how can we do that? We have a few options:

  1. Run HYPHY in GUI mode: This has the advantage of being easy to do. But it’s incredibly demanding of human input – who’s going to sit and click through thousands of HYPHY sessions? This input will also make it slower (depending on the analysis, the human component might be the limiting step); and it will certainly introduce the potential for human errors.
  2. Create a custom HYPHY batch file, and rename the input files in each locus: In other words, a script which looks for input files named something like ‘input.fasta‘ and ‘input.tre‘, and executes them. Unfortunately, there’s a risk we might over-write files we don’t want to, if one or more HYPHY calls fail part-way through. It could also be hard to parallelise this.
  3. Create a custom HYPHY batch file to loop through the input directories: This is how we probably ought to do things natively in the ‘HYPHY way’ – HBL is powerful enough to let us do things like read directory contents, split and test and generally manipulate strings etc. So we could probably work out how to write a wrapper batch file in HBL for HYPHY that would call BranchSiteREL.bf . But do we really want to delve deeply into yet another language just to do that? And suppose we wanted to run the same analysis on another set of data in a month or so – we’d have to edit the wrapper file to loop through a different directory…
  4. What we really want to do is pass arguments to HYPHY using the command-line: That is, we want to be able to use the STDIN standard input stream to pass the input alignment and phylogeny files’ paths to HYPHY, read them into BranchSiteREL.bf  as variables, and execute the batch file with no further input. This method will be flexible – we can use any paths we want, and change them at any time – and modular because we won’t have lots of different BranchSiteREL.bf files sitting about for analyses at different times, just one.

It turns out that it’s actually pretty easy to do this – it took me an hour or so to work it out, and a couple more for implementation and testing – and with this guide you should be able to do it far quicker. There are several steps:

  1. Refactor the existing batch file to collect key variables
  2. Edit batch file to read variables from STDIN
  3. Call HYPHY in command-line mode, passing variables in-place as a ‘here’ string

That’s it! Here are the steps in detail:

1. Refactor the existing batch file to collect key variables

(NB: links to my hacked copies further down this page)

If you’re not familiar with HYPHY (and if you were, you probably wouldn’t be interested in this hack), this will be the intimidating bit. But relax: if you know C, Java, Perl, or any modernish procedural language, this is easy.

What we want to do is take the existing standard analysis batch file which came with HYPHY, BranchSiteREL.bf, and work out all the places where HYPHY expects user input. We’ll need to either hardcode those, or pass variables from the command-line. To make this less likely to break, we’re going to a) work on a copy of the batch file (mine’s called BranchSiteREL_joeHack.bf), and b) refactor the code so all those variables are initialised right at the start of the batch file, where we can see them.

To start with, run the batch file in GUI mode as normal. This lets you check the input files are actually formatted correctly. Also note down all the points where the script asks for input, and what you want those inputs to be. In the REL test, the steps are: pick genetic code (‘universal’); input alignment (‘hyphy-input.fasta’); input phylogeny (‘hyphy-input.tre’); and output file (‘hyphy-output.REL’ but really, output file prefix – there’s several outputs in fact, which will share this prefix). Now we can go to the head of the copied BranchSiteREL_joeHack.bf file, and set these variables up. To start with, we’ll hardcode them. Later, we’ll read them from the command line via standard input. I’ve used ALL_CAPS variables for readability, not that HBL cares:

/* Variables we'll define and later set by STDIN */
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_GENETIC_CODE = 1;
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";

/* Start of normal batch file */
skipCodeSelectionStep = 0;
LoadFunctionLibrary("chooseGeneticCode_HardcodeUniversal");

LoadFunctionLibrary("GrabBag");
LoadFunctionLibrary("dSdNTreeTools");
LoadFunctionLibrary("CF3x4");
LoadFunctionLibrary("BranchSiteTemplate");
...

So the four variables we’ve introduced are: JOE_HARDCODE_ALIGNMENT; JOE_HARDCODE_PHYLOGENY; JOE_HARDCODE_GENETIC_CODE; and JOE_HARDCODE_OUTPUT. We’ve defined these, but they’re not actually used anywhere yet – as things stand, HYPHY will still try and ask the user for input. What we need to do instead is go through the batch file looking for methods that prompt the user for input, and replace them with our variables instead. From a quick read of the HBL documentation (nb, the HTML documentation that comes with HYPHY is more useful), there seem to be two main ways HYPHY gets user input. They are:

/* fscanf() - reads input to a variable, e.g from console (command-line) to a string, as here: */
fscanf(stdin,"String",SOME_VARIABLE);
/* PROMPT_FOR_FILE, a special variable that opens a system dialog/file chooser, as here: */
DataSet ds = ReadDataFile(PROMPT_FOR_FILE);

All we need to do is look through the batch files and the places where the user interactions we noted in our GUI session happened, and replace the fscanf()‘s or PROMPT_FOR_FILE‘s with our variables. Then when we change the variables from being hardcoded to being passed as arguments at the command-prompt, we’ll have our complete program. In the case of BranchSiteREL.bf, there are in fact a number of included scripts (additional batch files or model definition files) used in the analysis – so in some cases we need to change those too. Make sure to use copies and rename them…

The datafile (alignment)
This is found in BranchSiteREL.bf:11, as above. This line is easy to find and change:

11
12
13
14
15
16
17
DataSet ds = ReadDataFile(PROMPT_FOR_FILE);
/* Change PROMPT_FOR_FILE
to our initialised JOE_HARDCODE_ALIGNMENT

Make sure to _replace_ 'PROMPT_FOR_FILE'
or comment out the old line if you want to copy it! */

DataSet ds = ReadDataFile(JOE_HARDCODE_ALIGNMENT);

The output files’ prefix
This is found in BranchSiteREL.bf:47, as above. Also easy, although PROMPT_FOR_FILE is used in an odd context:

46
47
48
49
SetDialogPrompt ("Save analysis results to");
fprintf (PROMPT_FOR_FILE, CLEAR_FILE, KEEP_OPEN,"Branch,Mean_dNdS,Omega1,P1,Omega2,P2,Omega3,P3,LRT,p,p_Holm")
/* Replace PROMPT_FOR_FILE with JOE_HARDCODE_OUTPUT */
fprintf (JOE_HARDCODE_OUTPUT, CLEAR_FILE, KEEP_OPEN,"Branch,Mean_dNdS,Omega1,P1,Omega2,P2,Omega3,P3,LRT,p,p_Holm");

The tree (phylogeny)
Annoyingly, this is found in a required batch file, not the main one. It’s found in queryTree.bf, so we need to locate this file, rename it, edit it, and also edit the place where it is called so that our hacked version is called instead. queryTree.bf itself is found in the same directory (TemplateBatchFiles) as BranchSiteREL.bf. I copied it to queryTree_hardcode.bf. Within this the relevant line is queryTree.bf:59, with a similar syntax to the output file:

55
56
57
58
59
60
61
62
63
if (!IS_TREE_PRESENT_IN_DATA)
{
SetDialogPrompt ("Please select a tree file for the data:");

fscanf (PROMPT_FOR_FILE, REWIND, "Raw", treeString);
/* As before, replace PROMPT_FOR FILE
with our phylogeny variable. In my case,
JOE_HARDCODE_PHYLOGENY*/

fscanf (JOE_HARDCODE_PHYLOGENY, REWIND, "Raw", treeString);

Because this is an external function library, we need to find where in BranchSiteREL.bf it’s imported, and make sure our hacked copy is instead. We need BranchSiteREL.bf:44

44
45
46
47
LoadFunctionLibrary ("queryTree");
/* Replace with our queryTree_hardcode.bf
(the *.bf suffix isn't needed) */

LoadFunctionLibrary ("queryTree_hardcode");

The genetic code translation definitions
The genetic code translation type is also handled in an external library, chooseGeneticCode.def, but annoyingly, this isn’t in TemplateBatchFiles, but a TemplateBatchFiles/TemplateModels subdirectory. Such is life… again, I’ve worked on a copy, chooseGeneticCode_HardcodeUniversal.def, and after modifying the library itself we need to edit the library call to make sure our hacked version is pulled in. First, the edit, which uses a slightly different, but still intuitive syntax, found at chooseGeneticCode.def:95:

95
96
97
98
99
100
101
102
103
104
105
106
107
108
if (!skipCodeSelectionStep)
{
/* this is where the user input routine ChoiceList() is called... */
ChoiceList (modelType,"Choose Genetic Code",1,SKIP_NONE,_geneticCodeOptionMatrix);

if (modelType &lt; 0)
{
return;
}
/* but this is where the variable is actually set... */
ApplyGeneticCodeTable (modelType);
/* ... so we'll replace modelType with our global JOE_HARDCODE_GENETIC_CODE variable */
ApplyGeneticCodeTable (JOE_HARDCODE_GENETIC_CODE);
}

The corresponding call to TemplateModels.chooseGeneticCode.def in BranchSiteREL.bf is right back at line 2:

1
2
3
4
5
skipCodeSelectionStep = 0;
LoadFunctionLibrary("chooseGeneticCode");
/* Replace the default library with our hacked one -
Note that the subdirectory path isn't needed; the TemplateModels subdirectory is searched by default */

LoadFunctionLibrary("chooseGeneticCode_HardcodeUniversal");

 

2. Edit batch file to read variables from STDIN

Phew! Good news is that was the fiddly bit; the rest of this is all easy. The next step is to replace the hardcoded variable initalisations at the head of our BranchSiteREL.bf copy with fscanf() methods that will assign values to these variables from the standard input (command-line). So we’ll comment out:

1
2
3
4
5
6
7
8
/* Variables we'll define and later set by STDIN */
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_GENETIC_CODE = 1;
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";
/* Start of normal batch file */
skipCodeSelectionStep = 0;
...

And replace them with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Variables we'll define and later set by STDIN */
/* comment out the hardcoded definitions ...
JOE_HARDCODE_ALIGNMENT = "hyphy-input.fa";
JOE_HARDCODE_PHYLOGENY = "hyphy-input.tre";
JOE_HARDCODE_GENETIC_CODE = 1;
JOE_HARDCODE_OUTPUT = "hyphy-output.REL";

And replace with stdin read via fscanf(): */

fscanf(stdin,"String",JOE_HARDCODE_ALIGNMENT);
fscanf(stdin,"String",JOE_HARDCODE_PHYLOGENY);
fscanf(stdin,"String",JOE_HARDCODE_OUTPUT);
JOE_HARDCODE_GENETIC_CODE = 1; // OK, we'll keep this one hardcoded for now
/* Start of normal batch file */
skipCodeSelectionStep = 0;
...

These are pretty self-explanatory. Done!

3. Call HYPHY in command-line mode, passing variables in-place as a ‘here’ string

At this point, all we’ve really done is refactor the batch file. We’ve moved where the variables are initalised / set, so that we can find them easily, and we’ve called fscanf() on each them in order to set them. So far, because the implies someone, somehow, will need to type stuff into stdin at a prompt, this doesn’t actually solve our main problem – how to pass variables on the command line to HYPHY – but what it has done is made everything a lot neater. Note that these are still three separate calls to fscanf(), however – which means HYPHY will expect three discrete chunks of user interaction. In a nutshell, if we ran HYPHY now, we’d get something like:

>HYPHY: Please choose a data file:
me: /foo/bar/hyphy_input.fa

>HYPHY: Please select a tree:
me: /foo/bar/hyphy_input.tre

>HYPHY: Please choose a file for output:
me: /foo/bar/hyphy_output.REL

So we need to get bash to accept input from a file or command-line, and pass it onto HYPHY each time HYPHY wants input. The easy way to do this is to put each user response on a separate line in a shell.sh file, and use the ‘<‘ switch to redirect the standard input stream to this file, instead of the keyboard. This might look a bit like:

# in: commands.sh
hyphy-input.fasta # the alignment
hyphy-input.tre # the tree
hyphy-output.REL #the output

# HYPHYMP (the binary) could then be called with:
$user~: HYPHYMP BranchSiteREL_joeHack.bf &lt; commands.sh

But that wouldn’t really help us, would it? We’d have to edit commands.sh separately for each locus! Luckily there is a handy Bash trick which I had to search for a bit – the ‘here’ string (I found this on LinuxJournal). This lets us redirect a string in-place to the command-line, and takes the form:

$user~: command <<<'input_string_to_stdin'

Remembering that we had three fscanf() calls, one for each of our refactored variables, we’ll need three inputs. No problem (StackExchange to the rescue) – we can separate the inputs with newline (‘\n’) characters (we’ll also need the ‘$’ operator, to make sure bash interprets the newlines correctly), like this:

$user~: command <<<$'input_1\ninput_2\ninput_3'

This syntax is equivalent to giving the command command three separate and sequential inputs.

Putting it all together

Finally we’ve got everything we need to run HYPHY in command-line mode. To recap:

  • A command-line friendly version of HYPHY (see this post);
  • The edited versions of BranchSiteREL.bf, chooseGeneticCode.def and queryTree.bf, renamed and in place alongside their original copies;
  • Input alignment and tree files, and a writeable output directory;
  • A means (the ‘here’ or ‘<<<‘ operator) of sending multiple-line inputs to the standard input stream.

Running HYPHY on the command line with arguments passed

Let’s do this! There are a couple of additional options (CPU=integer, which sets the number of cores, and BASEPATH=/path/to/batchfiles, which ensures the right batchfile directory is being used) but don’t worry about those for now.

The complete command is :

/usr/local/bin/HYPHYMP CPU=number_of_cpu_cores BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ BranchSiteREL_joeHack.bf &lt;&lt;&lt;$'/path/to/hyphy_input.fa\n/path/to/hyphy_input.tre\n/path/to/hyphy_output.REL'

You can optionally use stuff like >log.out and 2>log.err to redirect STDOUT and STDERR if you want; also & to fork and leave running etc. But the critical bit of this command is the last bit, after the ‘<<<‘ handle. I’ve only tested this using absolute/full pathnames for the input/output file arguments – it’s a pain but less likely to break in the short-term (what happens if you move the whole project folder is another matter…)

I admit this looks absolutely horrible. But it’s the best I can do.

In practice

So for me (user=jparker) working from /Downloads/hyphy_hacks/hackinput with alignments hyphy-input.fa and hyphy-input.tre, and outputting to files with prefix run2, the complete command is:

/usr/local/bin/HYPHYMP CPU=2 BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ BranchSiteREL_joeHack.bf &lt;&lt;&lt;;$'/home/jparker/Downloads/hyphy_hacks/hackinput/hyphy-input.fa\n/home/jparker/Downloads/hyphy_hacks/hackinker/Downloads/hyphy_hacks/hackinput/run2'

And if I don’t want to wait for it to complete, and send stdout and stderr to some files, the command is:

/usr/local/bin/HYPHYMP CPU=2 BASEPATH=/usr/local/lib/hyphy/TemplateBatchFiles/ BranchSiteREL_joeHack.bf &lt;&lt;&lt;$'/home/jparker/Downloads/hyphy_hacks/hackinput/hyphy-input.fa\n/home/jparker/Downloads/hyphy_hacks/hackinker/Downloads/hyphy_hacks/hackinput/run4' &gt;run4.stdout 2&gt;run4.err &amp;

Lastly you can change the argument to the CPU= command if you want to. Be aware that by default HYPHYMP uses as many cores as it can see (I think)…

Migrating to OS X Mavericks

The time has come, my friends. I am upgrading from 10.6.8 (‘Snow Leopard’) to 10.9 (‘Mavericks’) on my venerable and mistreated MacBook Pros (one is 2010 with a SATA drive, the other 2011 with an SSD). Common opinion holds that the 2010 machine might find it a stretch so I’m starting with the 2010/SSD model first. Also, hey, it’s a work machine, so if I truly bork it, Apple Care should (should) cover me…

Availability

At least Apple make the upgrade easy enough to get: for the last year or so, Software Update has been practically begging me to install the App Store. Apple offer OSX 10.9 for free through this platform (yes! FREE!!) so it’s a couple of clicks to download and start the installer…

Preamble

Obviously I’ve backed up everything several times: to Time Machine, on an external HDD; to Dropbox; Drobo; and even the odd USB stick lying around as well as my 2010 MBP and various other machines I have access to. As well as all this, I’ve actually tried to empty the boot disk a bit to make space – unusually RTFM for me – and managed to get the usage down to about 65% available space. I’ve also written down every password and username I have, obviously on bombay mix-flavoured rice-paper so I can eat them after when everything (hopefully) works.

Installation

Click the installer. Agree to a few T&Cs (okay, several, but this is Apple we’re talking about). Hit ‘Restart’. Pray…

Results

… And we’re done! That was surprisingly painless. The whole process took less than two hours on my office connection, from download to first login. There was a momentary heart attack when the first reboot appeared to have failed and I had to nudge it along, but so far (couple of days) everything seems to be running along nicely.

Now, I had worried (not unreasonably, given previous updates) that my computer might slow down massively, or blow up altogether. So far this doesn’t seem to have happened. The biggest downsides are the ones I’d previously read about and unexpected: e.g. PowerPC applications like TreeEdit and Se-Al aren’t supported any more. Apparently the main workaround for this is a 10.6.8 Server install inside Parallels, but I’ll look into this more in a future post when I get a chance.

was a bit surprised to find that both Homebrew and, even more oddly, my SQL installation needed to be reinstalled, but a host of other binaries didn’t. Presumably there’s a reason for this but I can’t find it. Luckily those two at least install pretty painlessly, but it did make me grateful nothing else broke (yet).

So what are the good sides? The general UI is shiny, not that this matters much in a bioinformatics context, and smart widgets like Notifications are pretty, but to be honest, there aren’t any really compelling reasons to switch. I’ve not used this machine as a laptop much so far, so I can’t comment on the power usage (e.g. stuff like App Nap) yet, although it seems to be improved… a bit.. and I haven’t had time to run any BEAST benchmarks to see how the JVM implementation compares. But there is one massive benefit: this is an OS Apple are still supporting! This matters because stuff like security and firmware updates really do matter, a lot – and release cycles are getting ever shorter, especially as Macs get targeted more. In short: I couldn’t afford to stay behind any longer!

Update [5 Oct 2014]: Given the Shellshock bash exploit affects both 10.6 and 10.9, but Apple aren’t – as yet – releasing a patch for 10.6, while they rushed a 1.0 patch for 10.9 in less than a week, the security aspect of this upgrade is even more clearly important…

Update [23 Oct 2014]: Nope, I won’t be upgrading to Yosemite for a while, either!

Installing Ubuntu 12.04 LTS on a Dell Windows 8 machine with UEFI

Quick post this – originally on evolve.sbcs but behind a wall there so reposted here – both in case anyone’s interested and in case something breaks and we have to remember what we did, exactly (the far more likely scenario)

Ubuntu. Everyone loves it – probably the world’s easiest dual-boot install going and you get a lot of features out of the box as well. So, for the Windows user who needs Linux features / reliability / configurations but wants to keep a copy of Windows on their system for Office or just build testing, Ubuntu is a great solution.

Well, at least, it used to be. Then Microsoft went and released Windows 8. Quite apart from being an even worse resources hog than W7, it also introduced the horror that is UEFI and a whole load of hell was unleashed. That’s because UEFI – a seemingly harmless set of ‘improvements’ to bootloader management (the bit of firmware on your motherboard which loccates the OS, introduces it to RAM and disk controllers, and generally makes everything you see happen) – is actually a right pain in the arse. Typically for Microsoft, UEFI isn’t just overloaded with features which dilute it’s core purpose: it actually introduces unpredictable behaviour which makes life for anyone nosing beneath the surface of the operating system (Windows) the machine comes with (not just on those trying to dual boot – although I suspect they were the main targets, as millions of people trying Ubuntu for free and realising what an unstable rip-off Windows 8 is wouldn’t play well with shareholders…)

Which is all a ranty way to say that the lovable Ubuntu install process I’ve used for pretty much a decade on Windows 7, XP, Vista and even 98 machines – all of which happily budge up a bit to make room for a dual-boot Ubuntu installation – has been well and truly borked by Windows 8 and their crappy UEFI lock-in. In particular, Hernani, one of our new students, has been issued a shiny new Dell (good for him) with Windows 8 loaded via UEFI (not so good). The install process was markedly more complicated than other Windows / BIOS versions, and while this is a common problem, I couldn’t find a single tutorial online able to help our specific problem – so here you go (PS: I / the lab / QMUL disclaim all liability, express or implied, should you attempt the steps below and suffer…):

  • Create an Ubuntu Live / bootable install USB
  • Shrink your Windows partition
  • Back up your files and make a Windows Restore CD / USB
  • Access the UEFI bootloader
  • Install Ubuntu
  • Re-activate UEFI
  • Check GRUB
  • Install boot-repair
  • Check GRUB
  • Keep backing up your files!

Create an Ubuntu Live / bootable install USB (or CD)

First we need to create a CD or a USB which we’ll use to install Ubuntu (and/or run a ‘live’ session, where you run the OS directly off the disk without installing to the hard drive. This lets you test Ubuntu and verify it works on your machine). This is a very well covered topic, not least from Ubuntu.com themselves, but I’ll just note that we installed Ubuntu 12.04 LTS (long-term stable), 64-bit release onto a USB – the machine in question having no optical drive. We used Unetbootin for this.

Shrink your Windows partition to create space on the hard drive for Ubuntu

To install Ubuntu we’ll need at least 100Gb free disk space. How much of the available space you want to give to Ubuntu is up to you, and probably depends on whether you plan to use your machine mainly in Windows 8 or Ubuntu; Ubuntu can see the Windows partitions, but not the other way round. To do this we need to shrink the existing Windows partition and repartition the existing space. This subject is well covered in this guide (which we started with) but the subsequent steps gave us more trouble (hence this post)…

Back up your files and make a Windows Restore CD / USB

There is a fairly high chance that this will result in a fried filesystem, and you may not even be able to get Windows to boot. You might also lose data if you screw up the disk partitions. So back up all your files. This is very important – not doing so practically guarantees the jinns of Murphy will screw up your hard drive for you. I don’t care how you do this, just do it.

Equally, since we might bork Windows well-and-truly (possibly a good thing but it would be nice to do so on purpose rather than accidentally) it makes sense to burn a Windows Recovery CD (or USB) at the very least. Again, you can do this easily and there’s lots of tutorials out there, so I won’t deal with it here, except to say you’ll need this to be on a separate USB/CD to the Ubuntu one we just made.

Access the UEFI layer and switch to legacy BIOS

The operating system itself (Windows, OSX, Linus etc) doesn’t actually run straight away when you start the computer. This is, in fact, why you can install / upgrade / change / dual-boot various different OSes. Instead, a small piece of software hardcoded onto your motherboard which runs when you hit the power button to start the machine. This is the black screen you glimpse as the machine reboots. Most commonly this software (‘firmware’ in fact, since it is rarely changed or updated) is called BIOS, but a new standard called EFI has become more common in recent years (Windows’ own flavour is the hellish UEFI). It’s main job is to find all the various device drivers (monitor, keyboard, hard disks, USB ports), introduce them to the CPU and RAM, and then hand the whole lot over to an operating system on a disk somewhere to make the good stuff happen. Note that I said ‘an‘ operating system, located somewhere on ‘a’ disc – there’s no reason why Windows should be picked, and why the disc should be the hard drive, and not… say… an Ubuntu install USB! :)

So, to access this magical BIOS-y, UEFI-y world where we can tinker about with the default OS we need to shut the computer down, then restart it, and in most personal computers until now hitting a setup key on restart (usually something like F2 or F12) repeatedly would stop the BIOS/UEFI from loading and access a special setup screen. Unfortunately, Windows 8 has a ‘feature’ (read: ‘restriction’) built in called Secure Boot which normally prevents you from doing this insanely simple procedure. Instead there’s a vastly more complicated process which is luckily well explained in this SO thread. Scroll down to ‘Windows 8 + Ubuntu‘ to do this.

Once you have rebooted into BIOS/UEFI, the version that shipped with this Dell (an Opteron 9020 ultra-compact) has a screen with an ‘advanced boot/UEFI options’ submenu and a ‘secure boot’ submenu. Enter the secure boot menu, disable secure boot and click ‘apply’ to save changes then enter the UEFI screen. The most important option on this screen selects between bootloader modes: ‘UEFI’ and ‘legacy’ (which means BIOS). We’ll need the legacy / BIOS mode to install Ubuntu from USB, so select this. The previous set of UEFI options will be replaced by a set of devices. This is the main difference from our point of view – in BIOS mode the computer will try each device for a bootloader file, in the order in which they appear, and the first one found will be run. EFI, on the other hand, is able to look non-sequentially on a device (e.g. a disc) for a specific bootloader. However: we just want to get on with our Ubuntu installation, so make sure the device with your Ubuntu installer (USB or CD) is listed first, click ‘apply’ then ‘exit’ (making sure said media is actually inserted!)

Install Ubuntu

The process to install Ubuntu itself is actually simple enough, and covered in truly exhaustive (some might say ‘tedious’) detail elsewhere – but there’s a couple of installation points we need to note at this stage for our specific application. Again, if you’re doing a complete wipe of Windows you can just to a complete erase and you’ll be fine:

  • Select ‘something else’  in the main install options – do not erase Windows if you want to dual-boot.
  • There’s probably not much point in installing the updates and third-party stuff at this stage – I had to reinstall Ubuntu a few times by trial and error, and the extra packages take a while to download, unpack, and install. Anyway, you can get them later on when the installation’s debugged.
  • The order and location of your partitions matters, a lot. This process is covered in more detail in this SO thread, but to summarise (you should read the whole SO post though):
    • You’ll need 50Gb (50000Mb) for the Ubuntu OS itself, formatted as ext4 and mounted as root ‘/’. On our Dell this is /dev/sda7
    • You’ll need 16Gb (15999Mb) for swap space (Windows users will know this as ‘virtual RAM’. This is technically optional but running without it is very likely to crash frequently from RAM shortages.
    • Don’t touch the existing (Windows) partitions
    • You’ll want to partition the remaining disk space as ext4 (probably) and mount it at ‘/home’
  • Make a note of the password you select for Ubuntu! Otherwise you’ll have to reinstall…
  • Finally – and this is important – although other guides say the bootloader should be written to:
    /dev/sda

    we found this didn’t work with our Dell UEFI. Instead we had to install the bootloader to:

    /dev/sda1

    which worked fine.

Re-activate UEFI

The install process complete, we now need to switch the UEFI back on; this will continue to be the main way you call the bootloader from now on, but hopefully we’ll be using GRUB (a linux bootloader) instead of the Windows bootloader. GRUB allows you to pick which OS to boot every time you restart the computer (it defaults to Ubuntu, but you can choose Windows 8 if you want, it won’t care!) and should be configured automatically. However, we’ll need to turn UEFI back on first: shut the computer down. Wait 5-10 seconds. Now restart it, hitting F2 (or other BIOS / UEFI setup menu hotkey) as soon as you hit the power button to access the UEFI / BIOS screen again.

From the BIOS / UEFI, find the advanced boot options, deselect Legacy mode (BIOS) and reselect UEFI. This will put us in UEFI mode again, so we have to configure the UEFI options to choose the GRUB bootloader. You should see that an ‘ubuntu’ option has appeared in the UEFI list as well as the Windows boot manager. Select this, then click ‘view’ to see which bootloader file it points to.

In our installation, ‘ubuntu’ pointed to (filesystem FS0): /EFI/boot/shimx64.efi. This will simply load ubuntu directly. If you only plan to use Ubuntu, you can stop here, as this option will find Ubuntu every time you reboot. However, we wanted to use GRUB to pick Ubuntu or Windows, so there’s an additional step here: Click ‘add boot option’ (or ‘path’, or similar) to create a new UEFI boot option. Give it a name (we went for ‘GRUB’, logically enough). Then we need to pick the bootloader file itself – this is the file the UEFI will try and find, which will then handle the OS selection and initialisation. In our case, the file we were looking for was (again on filesystem ‘FS0’): /EFI/boot/grubx64.efi. This is the file to load GRUB, but while you’re poking about in here you may also see a directory for Windows and the bootloaders (there’s a few) associated with it. Select this file, save the UEFI option and make sure this is the first boot option in the list (use arrows to move it up) and/or uncheck (don’t delete…) the other options (‘Windows boot manager’ and ‘ubuntu’ probably).

Click ‘apply’ then ‘exit’ and the machine will reboot…

 

Check GRUB

If we were installing next to Windows 7, Vista, XP, NT, 2000, 98 – or almost any other OS – we’d be able to have a cup of tea and a celebratory hobnob at this point, as after rebooting the machine the friendly GRUB boot selection screen would pop up at this point and let us select Ubuntu, Ubuntu recovery mode, memtest or indeed Windows with a simple dialog. We found that GRUB loaded OK, but although all the linux-y options came up, it couldn’t show the Windows boot option. Damn you, Windows 8 – in W7 this is no problem at all. We also wouldn’t have had to prat about with UEFI settings, either but hey, Windows 8 is shiny and looks a bit like iOS, so it must be better, right…

A fair few people have actually managed to get GRUB to function correctly at this stage on W8, but then they’re not reading this blog, are they? 😉

Install boot-repair

If the GRUB bootloader works (that is, you see the GRUB selection screen) but the Windows installation you irrationally cherish so much isn’t shown, you’ll need to edit the GRUB config files to include the location of the Windows bootloader files so that GRUB can display them as an option for you. You can sleuth about the hard drive to find both the GRUB config and Windows .efi files but luckily there’s a handy script from YannUbuntu called boot repair which does this for you. I followed instructions from the SO thread we used before (here, section ‘REPAIRING THE BOOT’). In their words, “In Ubuntu after it boots, open a Terminal and type the following:


sudo add-apt-repository ppa:yannubuntu/boot-repair  
sudo apt-get update
sudo apt-get install boot-repair

“Now run

boot-repair"

That will bring up the boot-repair script (even has a handy GUI). This is pretty easy to use; the ‘recommended repair’ will probably fix things for you assuming you’ve set everything else up OK.

Check GRUB again

Reboot. Pray. Sorry, but ours was working after this step, so you’re on your own if you haven’t had much luck by now. The first thing I would do if this failed would be to check the BIOS was set to UEFI, and that the GRUB option you edited with boot-repair was active (and first) in the list.

Keep backing up your files!

The forums seem to suggest there’s a chance your system will randomly fall over in the next few weeks. So while you should back up your files in general anyway – especially if you’re working with RCUK data – it pays to be extra-vigilant for the next month.

 

That’s all.. good luck!