We were having a chat over coffee today and a question arose about merging data from multiple databases. At first sight this seems pretty easy, especially if you’re working with relational databases that have unique IDs (like, uh, a Latin binomial name – Homo sapiens) to hang from… right?
But, oh no.. not at all. One important reason is that seemingly similar data fields can be extremely tricky to merge. They may have been stated with differing precision (0.01, 0.0101, or 0.01010199999?), be encoded in different data types (text, float, blob, hex etc) or character set encodings (UTF-8 or Korean?) and even after all that, refer to subtly different quantities (mass vs weight perhaps). Who knew database ninjas actually earnt all that pay.
So it was surprising, but understandable, to learn that a major private big-data user (unnamed here) stores pretty much everything as text strings. Of course this solves one set of problems nicely (everyone knows how to parse/handle text, surely?) but creates another. That’s because it is trivially easy to code the same real-valued number in multiple different text strings – some of which may break sort algorithms, or even memory constraints. Consider the number ‘0.01’: as written there’s very little ambiguity for you and me. But what about:
“0.01”,
“00.01”,
” 0.01″ (note the space),
or even “0.01000000000”?
After a quick straw poll, we also realised that, although we knew how most of our most-used programming languages (Java for me, Perl, Python etc for others) performed the appropriate conversion in their native string-to-float methods. We knew how we thought they worked, and how we hoped they would, but it’s always worth checking. Time to write some quick code – here it is, on GitHub
And in code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | package uk.ac.qmul.sbcs.evolution.sandbox; /** * Class to test the Float.parseFloat() method performance on text data * In particular odd strings which should be equal, e.g. * <ul> <li>"0.01"</li> <li>"00.01"</li> <li>" 0.01" (note space)</li> <li>"0.0100"</li> </ul> * NB uses assertions to test - run JVM with '-ea' argument. The first three tests should pass in the orthodox manner. The fourth should throw assertion errors to pass. * @author joeparker * */ public class TextToFloatParsingTest { /** * Default no-arg constructor */ public TextToFloatParsingTest(){ /* Set up the floats as strings*/ String[] floatsToConvert = {"0.01","00.01"," 0.01","0.0100"}; Float[] floatObjects = new Float[4]; float[] floatPrimitives = new float[4]; /* Convert the floats, first to Float objects and also cast to float primitives */ for(int i=0;i<4;i++){ floatObjects[i] = Float.parseFloat(floatsToConvert[i]); floatPrimitives[i] = floatObjects[i]; } /* Are they all equal? They should be: test this. Should PASS */ /* Iterate through the triangle */ System.out.println("Testing conversions: test 1/4 (should pass)..."); for(int i=0;i<4;i++){ for(int j=1;j<4;j++){ assert(floatPrimitives[i] == floatPrimitives[j]); assert(floatObjects[i] == floatPrimitives[j]); } } System.out.println("Test 1/4 passed OK"); /* Test the numerical equivalent */ System.out.println("Testing conversions: test 2/4 (should pass)..."); for(int i=0;i<4;i++){ assert(floatPrimitives[i] == 0.01f); } System.out.println("Test 2/4 passed OK"); /* Test the numerical equivalent inequality. Should PASS */ System.out.println("Testing conversions: test 3/4 (should pass)..."); for(int i=0;i<4;i++){ assert(floatPrimitives[i] != 0.02f); } System.out.println("Test 3/4 passed OK"); /* Test the inversion */ /* These assertions should FAIL*/ System.out.println("Testing conversions: test 4/4 (should fail with java.lang.AssertionError)..."); boolean test_4_pass_flag = false; try{ for(int i=0;i<4;i++){ for(int j=1;j<4;j++){ assert(floatPrimitives[i] != floatPrimitives[j]); assert(floatObjects[i] != floatPrimitives[j]); test_4_pass_flag = true; // If AssertionErrors are thrown as we expect they will be, this is never reached. } } }finally{ // test_4_pass_flag should never be set true (line 62) if AssertionErrors have been thrown correctly. if(test_4_pass_flag){ System.err.println("Test 3/4 passed! This constitutes a logical FAILURE"); }else{ System.out.println("Test 4/4 passed OK (expected assertion errors occured as planned."); } } } public static void main(String[] args) { // TODO Auto-generated method stub new TextToFloatParsingTest(); } } |
If you run this with assertions enabled (‘/usr/bin/java -ea package uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest’) you should get something like:
Test 1/4 passed OK
Testing conversions: test 2/4 (should pass)...
Test 2/4 passed OK
Testing conversions: test 3/4 (should pass)...
Test 3/4 passed OK
Testing conversions: test 4/4 (should fail with java.lang.AssertionError)...
Exception in thread "main" java.lang.AssertionError
at uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest.<init>(TextToFloatParsingTest.java:60)
at uk.ac.qmul.sbcs.evolution.sandbox.TextToFloatParsingTest.main(TextToFloatParsingTest.java:76)
Test 4/4 passed OK (expected assertion errors occured as planned.