I’ve been fooling around with some natural language data from OPUS, the “open parallel corpus.” This contains many gigabytes of movie subtitles, UN documents and other text, much of it tagged by part-of-speech and aligned across multiple languages. In total, there’s over 50 GB of data, compressed.

“50 GB, compressed” is an awkward quantity of data:

Let’s look at various ways to tackle this.

Last year, I had pretty good luck parsing and manipulating this data with Rust. I could zip through a 2 GB slice of the raw Opus corpus fast enough to not get bored, and I had text I/O running at about 250 MB/s. But tools like Spark, Trill and Naiad can at least occasionally do better—I’ve heard of one them processing over 1 GB/s.

My earlier Rust code faced two big limits:

  1. It was working with unparsed data.
  2. Even if a given query only needed to touch two columns, it still had to read all the columns into memory.

The obvious solution here is a column store, which would store each column in a separate file, and only read in the columns required for a given query. The columns could be pre-processed for maximum efficiency, and the data could be compressed using Snappy, which decompresses around 500 MB/s. This is one of the techniques used by Google’s Dremel and BigQuery systems, which can query trillion-row tables in seconds.

Getting ready to benchmark in Rust

To test out this approach, I’m writing benchmarks in Rust. First, I need to import the test::Bencher class, which isn’t officially stabilized yet, so I use the nightly build of Rust and #![feature(test)]:

extern crate test;

use std::mem::size_of;
use test::{Bencher, black_box};

Next, let’s pick out a suitable table size. Since we’re starting out with RAM-based tables, let’s go for a million rows and 200 columns:

const ROWS: usize = 1_000_000;
const COLUMNS: usize = 200;

This is large enough that I don’t want to zero it out using ordinary, safe Rust code. Instead, let’s write an unsafe function that allocates a vector of type T containing uninitialized garbage:

/// Create a Vec full of garbage data for benchmarking.  Only safe for
/// fairly primitive element types.
unsafe fn garbage_vec<T>(sz: usize) -> Vec<T> {
    let mut v = Vec::with_capacity(sz);
    // Extended `v` to include uninitialized memory.
    // eddyb on #rust says that accessing uninitialized memory is
    // undefined behavior, but that we can use black_box to make the
    // optimizer be sufficiently cautious.
    black_box(&mut v);

Yes, this is a hack. In a real program, we’d be working with memory from the OS.

Simulating a row store

And now we’re ready to try some performance tests. First, let’s allocate a 1,000,000 element buffer where each row contains 200 i64 columns, and then sum up 4 of those columns:

fn bench_row_store(b: &mut Bencher) {
    // Store all our data in one giant vector with wide rows.
    let data: Vec<[i64; COLUMNS]> = unsafe { garbage_vec(ROWS) };

    // Tell the benchmarker how much memory we touch per iteration.
    b.bytes = (ROWS * COLUMNS * size_of::<u64>()) as u64;

    // Our actual benchmark.
    b.iter(|| {
        let mut sum = 0;
        for row in data.iter() {
            sum += row[0] + row[50] + row[100] + row[150];

The iter function runs our closure many times, and averages out the results.

Simulating a column store

Now let’s try allocating memory for each column individually. We only bother to simulate the columns we care about for this example:

fn bench_column_store(b: &mut Bencher) {
    // Store each column separately.
    let col0: Vec<i64> = unsafe { garbage_vec(ROWS) };
    let col50: Vec<i64> = unsafe { garbage_vec(ROWS) };
    let col100: Vec<i64> = unsafe { garbage_vec(ROWS) };
    let col150: Vec<i64> = unsafe { garbage_vec(ROWS) };

    b.bytes = (ROWS * 4 * size_of::<u64>()) as u64;
    b.iter(|| {
        let mut sum = 0;
        for i in 0..ROWS {
            sum += unsafe {
                (col0.get_unchecked(i) +
                 col50.get_unchecked(i) +
                 col100.get_unchecked(i) +

We use get_unchecked to bypass Rust’s bounds tests.

Initial benchmark results

Running cargo bench, we get:

test bench_column_store ... bench:    673444 ns/iter (+/- 112419) = 47488 MB/s
test bench_row_store    ... bench:  39832001 ns/iter (+/- 1517923) = 40000 MB/s

Raw memory throughput is slightly higher for the column store (46 GB/s versus 39 GB/s), but it winds up being significantly faster, presumably because it touches less memory. The row store can scan 1,000,000 rows in 39.8 milliseconds, but the column store only needs 0.67 milliseconds, making it roughly 60 times faster. This works out very neatly, since we’re only touching 1/50th of the columns.

Standard caveats apply: Benchmarking is tricky, especially once you start working with large amounts of memory, and I’m probably making at least one dumb mistake above. But these preliminary results are good enough to convince me that a Rust-based column store is worth a further look. Of course, these benchmarks are measuring memory performance, and in the real world, I’m going to care more about SSD disk performance. But similar issues apply: the less data you need to touch, the faster it will go.

Also, I really enjoy Rust for this sort of thing. In particular, I can move seamlessly between high-level functional code and unsafe pointer bashing, and cargo makes it ridiculously easy to use third-party packages.

(Thank you to users of #rust who pointed out the undefined behavior above, and who helped me simplify the code. The blame for the low-level hackery is all mine, of course.)