BaTS (and Befi-BaTS), SHiAT, and Genome Convergence Pipeline have moved!

Important – please take note!
Headline:

  • All my phylogenetics software is now on GitHub, not websites or Google Code
  • Please use the new FAQ pages and issue/bug tracker forms, rather than emailing me directly in the first instance

Until now, I’ve been hosting the open-sourced parts of my phylogenetics software on code.google.com. These include the BaTS (and Befi-BaTS) tools for phylogeny-trait association correlations; the alignment profilers SHiAT (and Genious Entropy plugin), and the Genome Convergence API for the Genome Convergence Pipeline and Phylogenomics Dataset Browser. However, Google announced that they are ending support for Google Code, and from August all projects will be read-only.

I’ve therefore migrated all my projects to GithubThis will eventually include FAQs, forums and issue/bug tracking for the most popular software, BaTS and Genome Convergence API.

The projects can now be found at:

 

I am also changing how I respond to questions and bug requests. In the past I dealt with questions as they came in, with the odd explanatory post and a manual or readme with each release. Predictably, this meant I spent a lot of time dealing with duplicates or missing bugs or feature requests. I am now in the process of compiling a list of FAQs for each project, as well as uploading the manuals in markdown format so that I can update them with each release. Please bear with me as I go through this process. In the meantime, if you have an issue with a piece of software or think you have found a bug, please:

  1. Make sure you have the most recent version of the software. In most cases this will be available as an executable .jarfile on the project github page.
  2. Check the ‘Issues’ tab on the project github page. Your issue may be a duplicate, or already fixed by a new release. If your bug isn’t listed, please open a new issue giving as much detail as possible.
  3. Check the manual and FAQs to see if anyone else has had the same problem – I may well have answered their question already.
  4. If you still need an answer please email me on joe+bioinformaticshelp@kitserve.org.uk

Thanks so much for your support and involvement,

Joe

Embedding Artist profiles, playlists, and content from Spotify in HTML

Quick post this – turns out Spotify have added a really cool new function to their desktop application: You can now right-click any resource in Spotify (could be an artist, a playlist, a profile or a track or album) and get a link to the HTML code you need to embed it into another webpage. The link looks like this:

Untitled 2

The HTML is then copied to your clipboard, ready to drop into an artist webpage. Pretty cool eh? Let’s give it a spin:

1
<iframe src="https://embed.spotify.com/?uri=spotify%3Aartist%3A4qsWY8X6Yq3TTVe4gn6cnL" height="300" width="300" frameborder="0"></iframe>


Parsing numbers from multiple formats in Java

We were having a chat over coffee today and a question arose about merging data from multiple databases. At first sight this seems pretty easy, especially if you’re working with relational databases that have unique IDs (like, uh, a Latin binomial name – Homo sapiens) to hang from… right?

But, oh no.. not at all. One important reason is that seemingly similar data fields can be extremely tricky to merge. They may have been stated with differing precision (0.01, 0.0101, or 0.01010199999?), be encoded in different data types (text, float, blob, hex etc) or character set encodings (UTF-8 or Korean?) and even after all that, refer to subtly different quantities (mass vs weight perhaps). Who knew database ninjas actually earnt all that pay.

So it was surprising, but understandable, to learn that a major private big-data user (unnamed here) stores pretty much everything as text strings. Of course this solves one set of problems nicely (everyone knows how to parse/handle text, surely?) but creates another. That’s because it is trivially easy to code the same real-valued number in multiple different text strings – some of which may break sort algorithms, or even memory constraints. Consider the number ‘0.01’: as written there’s very little ambiguity for you and me. But what about:

“0.01”,
“00.01”,
” 0.01″ (note the space),
or even “0.01000000000”?

After a quick straw poll, we also realised that, although we knew how most of our most-used programming languages (Java for me, Perl, Python etc for others) performed the appropriate conversion in their native string-to-float methods. We knew how we thought they worked, and how we hoped they would, but it’s always worth checking. Time to write some quick code – here it is, on GitHub

And in code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
package uk.ac.qmul.sbcs.evolution.sandbox;

/**
* Class to test the Float.parseFloat() method performance on text data
*
In particular odd strings which should be equal, e.g.
*
<ul>
    <li>"0.01"</li>
    <li>"00.01"</li>
    <li>" 0.01" (note space)</li>
    <li>"0.0100"</li>
</ul>
*

NB uses assertions to test - run JVM with '-ea' argument. The first three tests should pass in the orthodox manner. The fourth should throw assertion errors to pass.
* @author joeparker
*
*/

public class TextToFloatParsingTest {

/**
* Default no-arg constructor
*/

public TextToFloatParsingTest(){
/* Set up the floats as strings*/
String[] floatsToConvert = {"0.01","00.01"," 0.01","0.0100"};
Float[] floatObjects = new Float[4];
float[] floatPrimitives = new float[4];

/* Convert the floats, first to Float objects and also cast to float primitives */
for(int i=0;i&lt;4;i++){
floatObjects[i] = Float.parseFloat(floatsToConvert[i]);
floatPrimitives[i] = floatObjects[i];
}

/* Are they all equal? They should be: test this. Should PASS */
/* Iterate through the triangle */
System.out.println("Testing conversions: test 1/4 (should pass)...");
for(int i=0;i&lt;4;i++){
for(int j=1;j&lt;4;j++){
assert(floatPrimitives[i] == floatPrimitives[j]);
assert(floatObjects[i] == floatPrimitives[j]);
}
}
System.out.println("Test 1/4 passed OK");

/* Test the numerical equivalent */
System.out.println("Testing conversions: test 2/4 (should pass)...");
for(int i=0;i&lt;4;i++){
assert(floatPrimitives[i] == 0.01f);
}
System.out.println("Test 2/4 passed OK");

/* Test the numerical equivalent inequality. Should PASS */
System.out.println("Testing conversions: test 3/4 (should pass)...");
for(int i=0;i&lt;4;i++){
assert(floatPrimitives[i] != 0.02f);
}
System.out.println("Test 3/4 passed OK");

/* Test the inversion */
/* These assertions should FAIL*/
System.out.println("Testing conversions: test 4/4 (should fail with java.lang.AssertionError)...");
boolean test_4_pass_flag = false;
try{
for(int i=0;i&lt;4;i++){
for(int j=1;j&lt;4;j++){
assert(floatPrimitives[i] != floatPrimitives[j]);
assert(floatObjects[i] != floatPrimitives[j]);
test_4_pass_flag = true; // If AssertionErrors are thrown as we expect they will be, this is never reached.
}
}
}finally{
// test_4_pass_flag should never be set true (line 62) if AssertionErrors have been thrown correctly.
if(test_4_pass_flag){
System.err.println("Test 3/4 passed! This constitutes a logical FAILURE");
}else{
System.out.println("Test 4/4 passed OK (expected assertion errors occured as planned.");
}
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
new TextToFloatParsingTest();
}

}


If you run this with assertions enabled (‘/usr/bin/java -ea package uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest’) you should get something like:

Testing conversions: test 1/4 (should pass)...
Test 1/4 passed OK
Testing conversions: test 2/4 (should pass)...
Test 2/4 passed OK
Testing conversions: test 3/4 (should pass)...
Test 3/4 passed OK
Testing conversions: test 4/4 (should fail with java.lang.AssertionError)...
Exception in thread "main" java.lang.AssertionError
    at uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest.<init>(TextToFloatParsingTest.java:60)
    at uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest.main(TextToFloatParsingTest.java:76)
Test 4/4 passed OK (expected assertion errors occured as planned.

Application note: ‘Befi-BaTS’ version 0.10.1 – Error rate and statistical power of distance-based measures of phylogeny-trait association.

In prep.

SUMMARY

Building on work presented previously (Parker et al., 2008), we study a number of more complex measures of phylogeny-trait association (implemented in the program Befi-BaTS / BaTS v0.10.1) which take into account the branch lengths of a phylogenetic tree in addition to the topographical relationship between taxa. Extensive simulation is performed to measure the Type II error rate (statistical power) of these statistics including those introduced in Parker et al. (2008), as well as the relationship between power and tree shape. The technique is applied to an empirical hepatitis C virus data set presented by Sobesky et al. (2007); their original conclusion that compartmentalization exists between viruses sampled from tumorous and non-tumorous cirrhotic nodules and the plasma is upheld. The association index (AI), migration (PS), phylodynamic diversity (PD) and unique fraction (UF) statistics offer the best combination of Type I error and statistical power to investigate phylogeny-trait association in RNA virus data, while the maximum monophyletic clade size (MC) and nearest taxon (NT) statistics suffer from reduced power in some regions of tree space.

Keywords: BaTS, hepatitis C virus, Markov-chain Monte Carlo, Phylogeny-trait association, Phylogenetic uncertainty, simulation.

Manuscripts in progress (all rights reserved – you may not copy or distribute these files; content and conclusions subject to change; strictly embargoed until publication in a peer-reviewed journal/book):

  • v1: (): .doc
  • v2 (01/01/2014): .docx
  • v3 (16/06/2017): .pdf
  • View this project on GitHub