Haskell: Queues without pointers
Haskell has been tying my brain in knots. Sure, it keeps teaching me all sorts of amazing things, but it’s also forcing me to relearn the basics.
Right now, I’m trying to implement simple data structures in Haskell. It’s challenging, because most typical data structures are based on updating pointers, and Haskell doesn’t allow that. (Well, I could cheat, and use the IO monad, but where’s the fun in that?)
Fortunately, Chris Okasaki has done some amazing research into Haskell (and ML) data structures. He wrote an excellent thesis, which was later published as Purely Functional Data Structures. It’s a comprehensive book, well-written, and highly recommended for anyone programming in a functional language.
Let’s take a look at a data structure from Okasaki’s book: A purely functional queue. It’s competitive with a traditional queue (on average, at least), but it doesn’t use any pointers.
The queue API
A queue is a list of values to process. We can add values to the back of the queue, and retrieve values from the front.
In Haskell, we often begin by writing down the types of our functions. If we’re lucky, the code will fill itself in almost automatically. The first part of our API is easy:
-- A queue holding values of type 'a'.
data Queue a = ???
-- Create a new queue.
newQueue :: Queue a
-- Test whether a queue is empty.
empty :: Queue a -> Bool
The next part is a bit trickier. Since we can’t update our actual
Queue
values, our data structure is read-only! We can work
around this by making enq
and deq
return an
updated version of the Queue
:
-- Add an item to the back of the queue,
-- returning the updated queue.
enq :: Queue a -> a -> Queue a
-- Remove an item from the front of the
-- queue, returning the item and the
-- updated queue.
deq :: Queue a -> (a, Queue a)
We could use this API as follows:
q = newQueue `enq` 1 `enq` 2 `enq` 3
(x,q') = deq q -- x is 1
(y,q'') = deq q' -- y is 2
Now, the tricky bit: Making it run quickly.
Haskell queues, in O(1) amortized time
We could represent our queues as lists, but they’d be slow. Appending a value to the end of a Haskell list requires walking over all the existing values.
The trick is simple: We represent a queue as two lists. The first list contains the front of the queue, and the second list contains the back. We store the back part in reverse order, so we can add a new value in constant time:
data Queue a = Queue [a] [a]
The newQueue
and empty
functions are trivial:
newQueue = Queue [] []
empty (Queue [] []) = True
empty _ = False
Adding a new item to the queue is easy. We just append it to the second list:
enq (Queue xs ys) y = Queue xs (y:ys)
The first two parts of enq
are similarly straightforward:
-- If the queue is empty, raise an error.
deq (Queue [] []) =
error "Can't deq from an empty queue"
-- If there's at least one item in the front
-- part of the queue, return it.
deq (Queue (x:xs) ys) = (x, Queue xs ys)
But what if the front part of the queue is empty? We can just reverse the back part and move it to the front:
-- If the front part is empty, reverse the
-- back part, move it to the front, and try
-- again.
deq (Queue [] ys) =
deq (Queue (reverse ys) [])
Our implementation of enq
runs in constant time. But
deq
occasionally needs to reverse a list, which requires
walking the entire length of the list. Fortunately, each item only gets
moved once, so deq
averages out to constant time.
And note that our queues are “persistent”–we can hold onto earlier versions of the queue, in case we need to implement “Undo” or some sort of backtracking. (Update: As mauke points out, using this queue persistently can break the amortized O(1) guarantees. See Okasaki for a discussion and simple fix.)
There’s a lot more good stuff in Okasaki’s book: True constant-time queues, an explanation of how laziness and strictness affect running time, and a wide variety of standard data structures. Highly recommended.
Want to contact me about this article? Or if you're looking for something else to read, here's a list of popular posts.
front
operation, so that you can peek at the front item without removing it (that also permits you to havedeq
just yield a new queue, rather than a pair, which is a bit cleaner). To make that work, never permit the heads list to go empty unless the whole queue is empty; it's a fairly trivial modification. With a bit more playing, you can actually get the worst-case down to O(1), provided you are willing to do a little more juggling. Hint: Don't wait till the heads list gets empty to move the tails, and think about interleaving append with reverse. The constant factors are worse, but not by a lot. I second your recommendation of Okazaki's book. It is well worth a close reading.rev
you need to make sure you do them lazily, and only at need -- but also sufficiently often that you have enough operations over which to amortize their cost. One way to interpret Okazaki's algorithm is to insure that, each time youdeq
, you take a step of appending the reverse of the tails to the heads, and of course you memoize the results so that you retain the persistence (if you only care about ephemeral queues, this is of course somewhat easier). The trick is to make sure you don't wait "too long," which you can do by starting a new rev/append as soon as tails is one position longer than heads. That's not the only solution, but it makes the analysis work out nicely.head
andtail
are easy enough (and fast enough, too). But to implement a queue, you also need some sort of "enqueue" operation. And that can get really slow, if you're not careful. Specifically, consider the following implementation:deq
useshead
andtail
(in the form of the(x:xs)
pattern match), and it runs in constant time, no matter how big the queue. Butenq
uses++
, which is seriously slow, because it makes a complete copy ofxs
every time we add an item to the queue. If we keep (say) an average of 1,000 items in the queue, that means we need to copy a 1,000 item list every time we callenq
. The alternate implementation you see above doesn't have this problem. In our example, it canenq
an item in a single operation 999 times out of 1000, and only reallocate the entire list 1 time in a 1000, instead of every single time. On average, that works out to O(1) time. For 10 lines of Haskell, it's actually a pretty decent queue.