Over the years, I’ve learned to be cautious with C++ pointers. In particular, I’m always very careful about who owns a given pointer, and who’s in charge of calling delete on it. But my caution often forces me to write deliberately inefficient functions. For example:

vector<string> tokenize_string(const string &text);

Here, we have a large string text, and we want to split it into a vector of tokens. This function is nice and safe, but it allocates one string for every token in the input. Now, if we were feeling reckless, we could avoid these allocations by returning a vector of pointers into text:

vector<pair<const char *,const char *>> tokenize_string2(const string &text);

In this version, each token is represented by two pointers into text: One pointing to the first character, and one pointing just beyond the last character.1 But this can go horribly wrong:

// Disaster strikes!
auto v = tokenize_string2(get_input_string());
munge(v);

Why does this fail? The function get_input_string returns a temporary string, and tokenize_string2 builds an array of pointers into that string. Unfortunately, the temporary string only lives until the end of the current expression, and then the underlying memory is released. And so all our pointers in v now point into oblivion—and our program just wound up getting featured in a CERT advisory. So personally, I’m going to prefer the inefficient tokenize_string function almost every time.

Rust lifetimes to the rescue!

Going back to our original design, let’s declare a type Token. Each token is either a Word or an Other, and each token contains pointers into a pre-existing string. In Rust, we can declare this as follows:

#[deriving(Show, PartialEq)]
enum Token<'a> {
    Word(&'a str),
    Other(&'a str)
}

The type &str represents a slice of a pre-existing String. It’s sort of like the pair<const char *,const char *> in C++. But the really interesting part here is the 'a. This is a named lifetime parameter, and it says, “A value of type Token has the same lifetime as the &str that it contains.”

Looking at the LLVM intermediate representation, Token looks like a nice, efficient data structure. It appears to be a tag byte for the enum, some padding, and two pointers for the &str:

%"enum.Token<[]>" = type { i8, [7 x i8], [2 x i64] }

Update: According to keeperofdakeys, those last two i64 values are actually a pointer and a length.

Parsing safely

Now we can define a safe tokenize_string3 function. Here, the function delaration says, “We take an input value of type &str with lifetime 'a, and we return a Vec<Token> where each token has lifetime 'a.”

fn tokenize_string3<'a>(text: &'a str) -> Vec<Token<'a>> {
    let mut result = vec![];
    for cap in regex!(r"(\w+)|(\W+)").captures_iter(text) {
        let token =
            if cap.pos(1).is_some() {
                Word(cap.at(1))
            } else {
                Other(cap.at(2))
            };
        result.push(token);
    }
    result
}

This works quite nicely:

#[test]
fn test_parse_safe() {
    assert_eq!(vec![Word("The"), Other(" "), Word("cat")],
               tokenize_string3("The cat"));
}

But what if we destroy text early?

But let’s rewrite this function to work like our C++ code, where our temporary string was destroyed before we tried to use our tokens:

#[test]
fn test_parse_unsafe() {
    let v = {
        let text = "The cat".to_string();
        tokenize_string3(text.as_slice())
    };
    assert_eq!(vec![Word("The"), Other(" "), Word("cat")], v);
}

Rust detects the error, and refuses to compile test_parse_unsafe:

main.rs:67:26: 67:30 error: `text` does not live long enough
main.rs:67         tokenize_string3(text.as_slice())
                                    ^~~~
main.rs:64:24: 70:2 note: reference must be valid for the block at 64:23...
(…code snippet deleted…)
main.rs:65:17: 68:6 note: ...but borrowed value is only valid for the block at 65:16
(…code snippet deleted…)

In other words, we can do all kinds of apparently reckless things with pointers, and Rust backs us up.

1Update: There are some good discussions of alternative C++ versions on Hacker News, /r/programming and /r/rust. But just to clarify:

  1. No responsible modern C++ programmer would ever mess around with pair<const char *,const char *> today. Yes, technically it’s just a pair of iterators. Modern code would instead use a boost::iterator_range, or something of the sort. However, all these types use pointers internally. I wanted to make the pointers explicit in my example, and I didn’t want to get into the implementation details of library types.
  2. The use of const string &text in tokenize_string2 allows the compiler to destroy the temporary string immediately. There are better ways of declaring this function which can provide limited protection against this error, at least within the scope of a single function. But do you really trust those solutions to survive five years of refactoring and maintenance programming?