Rust lifetimes: Getting away with things that would be reckless in C++
Over the years, I’ve learned to be cautious with C++ pointers. In
particular, I’m always very careful about who owns a given pointer, and
who’s in charge of calling delete
on it. But my caution often forces me
to write deliberately inefficient functions. For example:
vector<string> tokenize_string(const string &text);
Here, we have a large string text
, and we want to split it into a vector
of tokens. This function is nice and safe, but it allocates one string
for every token in the input. Now, if we were feeling reckless, we could
avoid these allocations by returning a vector of pointers into text
:
vector<pair<const char *,const char *>> tokenize_string2(const string &text);
In this version, each token is represented by two pointers into text
: One
pointing to the first character, and one pointing just beyond the last
character.1 But
this can go horribly wrong:
// Disaster strikes!
auto v = tokenize_string2(get_input_string());
munge(v);
Why does this fail? The function get_input_string
returns a temporary
string, and tokenize_string2
builds an array of pointers into that
string. Unfortunately, the temporary string only lives until the end of
the current expression, and then the underlying memory is released. And so
all our pointers in v
now point into oblivion—and our program just wound
up getting featured in a CERT advisory. So personally, I’m going to prefer
the inefficient tokenize_string
function almost every time.
Rust lifetimes to the rescue!
Going back to our original design, let’s declare a type Token
. Each
token is either a Word
or an Other
, and each token contains pointers
into a pre-existing string. In Rust, we can declare this as follows:
#[deriving(Show, PartialEq)]
enum Token<'a> {
Word(&'a str),
Other(&'a str)
}
The type &str
represents a slice of a pre-existing String
. It’s sort
of like the pair<const char *,const char *>
in C++. But the really
interesting part here is the 'a
. This is a named lifetime parameter,
and it says, “A value of type Token
has the same lifetime as the &str
that it contains.”
Looking at the LLVM intermediate representation, Token
looks like a nice,
efficient data structure. It appears to be a tag byte for the enum, some
padding, and two pointers for the &str
:
%"enum.Token<[]>" = type { i8, [7 x i8], [2 x i64] }
Update: According to keeperofdakeys, those last two i64
values
are actually a pointer and a length.
Parsing safely
Now we can define a safe tokenize_string3
function. Here, the function
delaration says, “We take an input value of type &str
with lifetime 'a
,
and we return a Vec<Token>
where each token has lifetime 'a
.”
fn tokenize_string3<'a>(text: &'a str) -> Vec<Token<'a>> {
let mut result = vec![];
for cap in regex!(r"(\w+)|(\W+)").captures_iter(text) {
let token =
if cap.pos(1).is_some() {
Word(cap.at(1))
} else {
Other(cap.at(2))
};
result.push(token);
}
result
}
This works quite nicely:
#[test]
fn test_parse_safe() {
assert_eq!(vec![Word("The"), Other(" "), Word("cat")],
tokenize_string3("The cat"));
}
But what if we destroy text
early?
But let’s rewrite this function to work like our C++ code, where our temporary string was destroyed before we tried to use our tokens:
#[test]
fn test_parse_unsafe() {
let v = {
let text = "The cat".to_string();
tokenize_string3(text.as_slice())
};
assert_eq!(vec![Word("The"), Other(" "), Word("cat")], v);
}
Rust detects the error, and refuses to compile test_parse_unsafe
:
main.rs:67:26: 67:30 error: `text` does not live long enough
main.rs:67 tokenize_string3(text.as_slice())
^~~~
main.rs:64:24: 70:2 note: reference must be valid for the block at 64:23...
(…code snippet deleted…)
main.rs:65:17: 68:6 note: ...but borrowed value is only valid for the block at 65:16
(…code snippet deleted…)
In other words, we can do all kinds of apparently reckless things with pointers, and Rust backs us up.
1Update: There are some good discussions of alternative C++ versions on Hacker News, /r/programming and /r/rust. But just to clarify:
- No responsible modern C++ programmer would ever mess around with
pair<const char *,const char *>
today. Yes, technically it’s just a pair of iterators. Modern code would instead use aboost::iterator_range
, or something of the sort. However, all these types use pointers internally. I wanted to make the pointers explicit in my example, and I didn’t want to get into the implementation details of library types. - The use of
const string &text
intokenize_string2
allows the compiler to destroy the temporary string immediately. There are better ways of declaring this function which can provide limited protection against this error, at least within the scope of a single function. But do you really trust those solutions to survive five years of refactoring and maintenance programming?
Want to contact me about this article? Or if you're looking for something else to read, here's a list of popular posts.