Java ain't so slow

I’m going to talk about Java, performance, and a proof-of-concept I created back in 2016 that provides a generalised brute-force search engine that can perform billions of comparisons in a fraction of a second.

Another title might have been ‘When brute force beats fancy algorithmns’.

Define ‘fast’

We often talk in a very binary fashion about programming language performance. C is fast. Python is slow. But it’s more nuanced than that. Context matters.

Actually, interpreted languages, like Python, have proved totally adequate (read that as ‘fast enough that you wouldn’t notice a difference’) for most web applications. This is primarily because these tend to be I/O bound, and spend time talking over a network, and more importantly to a database/store. These paths dominate all other factors.

Java is often talked about as being a slow language. But what do we mean when we say that? Is it slow in all contexts? The short answer is no, which is why it’s currently the dominant 80% language in the industry and foundational in the Hadoop ecosystem.

In fact, Java has great speed in a tight-loop. At the other end of the spectrum, the GC model has significant impacts on latency/speed during compactions - and it is definitely ‘slow’.

Doing the impossible with Java

Roll the clock back to 2011, and the LMAX Disruptor came on the Java scene. This was at the time fairly unprecedented, as it was being used as part of the LMAX trading platform. Trading platforms were the domain of ‘fast’ programming languages, not the likes of Java.

Martin Thompson and his LMAX colleagues had rethought what it meant for a computation to be fast, and had observed that software could be written to take advantage of some of the fundamental characteristics of the underlying hardware. It turned out that even Java could be fast.

The term ‘Mechanical Sympathy’ was coined. Martin has an awesome blog covering a number of different perspectives on mechanical sympathy. You should read it. I did. It changed the way I thought about software performance. I would highly recommend it to anyone in the data-engineering space - it’s a must read.

Where I began with Lefty

My interest was immediately spiked by the LMAX work and I read and reread Martin’s mechanical sympathy blog.

A few years later, at work we were performing similarity search over millions of hashes. We had a solution that was more than fast enough. But I had LMAX and mechanical sympathy in the back of my mind, and as I looked at people’s attempt to address this algorithmically I couldn’t help wonder whether brute force in combination with mechnical sympathy would produce better results, especially as you moved onto billion-scale datasets.

So Lefty, because it was a left-of-field idea (for me at least), was born.

Lefty started with a for loop over a relatively small amount of data, which I then attempted to generalise over many iterations into something that was more plug and play.

Where it ended

It turns out that you can do > 1 billion hamming distance comparisons in a fraction of a second on your laptop. This was at least an order of magnitude faster than the algorithmic approaches folk were using. Who knew.

The nice thing about Lefty was that it scaled vertically in a linear fashion, and with a bit more work could potentially do the same horizontally. So on a large AWS instance you could perform a hamming distance search over a billion hashes in less than 100 ms.

So how does Lefty achieve this?

So broadly, Lefty achieves this by being really dumb. It runs a comparison over every piece of data that you have. So how does this ever work?

Well, it turns out that if you can store a compact representation of your data, then computers are very very fast at accessing this data … if you do it in a serial and predictable manner. Assuming your comparison operation is not too computationally complex then you get great speed.

Lefty stores the data in what is effectively just a giant array in a slab of memory. It assigns a region of that slab to a core, and then reads it left-to-right. Modern computers have a series of caches that store data local to cores, if you read data in a predictable manner like this they will preload data into these caches. This will make your data access a zillion times faster than from main memory. (Note, that’s a pretty bad high-level explanation - you should really read the mechanical sympathy blog for the low-down.)

How well has Lefty aged?

Whilst Lefty had some utility at the time it was created, it has probably not aged so well in the three years since.

GPUs have taken a huge leap forward in the last few years, in terms of the amount of memory available and the pricing in the cloud. Given that modern GPUs have thousands of cores, then today you probably want to head in that direction for a solution in this space.

Still Lefty goes to show that when you think outside of the box you can get surprising performance out of ‘slow’ languages on a single node ; )

Beyond mechanical sympathy

Like I said before, Martin Thompson’s blog feels like it should be required reading for all data-engineers.

Beyond that I’d also highly recommend taking a look at Frank McSherry’s blog posts where he explores similar ideas, often in the data layout space. His Hilbert Curve work is super interesting - see Scalability! But at what COST?.

Owen on software