Hunting for Bugs in Rust

Way back in August I announced that I was starting in on "a project to QuickCheck Rust’s standard library data structures", here. And I did! The project is called bughunt-rust and I've been poking at it on weekends since, adjusting my approach based on papers I've been reading, experience gained writing test code and the kind of results I've been getting. This post goes through what I've been up to, where I see the project heading in the near term.

Anyhow, so, here's the thing. If you go look at bughunt's Cargo.toml you won't find a QuickCheck implementation. Have a look. What you will find is a fuzzer – honggfuzz – and the arbitrary crate. Why no QuickCheck?

The goal of randomized testing is to explore the internal state space of a program as rapidly, but thoroughly, as possible, go down all the branches, overflow any integers that are going to overflow etc. This is a CPU intensive project. The goal of randomized testing is, then, to fiddle with the variety of the random inputs to avoid duplication of work – exploring the same path through the program – or burning up time vigorously exploring only a small portion of the program's possible state space. Imagine a program which has a whole flow gated behind a check to make sure that some integer input to the check is both prime and has greater than 32 ones in its binary representation. If you're just generating random integers it's not impossible that your random test will find its way into that state space but it's not probable either. Good luck. Anyway, Claessen and Hughes' insight with QuickCheck was to allow the programmer to tailor the state space search directly by generating random instances of types; instead of generating a random integer in the testing of our previous prime-gated clause you, the programmer, might choose to generate only random primes. Much better odds. Really effective QuickCheck testing is done in the model fashion. That is, the system under test (SUT) is fed randomly generated 'operations' which are simultaneously fed to an 'obviously correct' model of the SUT. If the model and the SUT agree, no bug found. If they disagree, time to diagnose. QuickCheck then enters a 'shrink step' where the failing input is shrunk by programmer defined heuristics to, hopefully, fish out an easier to debug example. Speaking from personal experience, it's really frustrating to debug a program if the failing case you've got on hand is 30k operations. Model checking with QuickCheck is excellent.

Anyway, the trouble with most QuickCheck implementations is that they are blind. They don't know how well the program under tests' state space is being explored, can't tailor themselves at runtime and can't report to the programmer the results of the run. Did your model and SUT agree or did QuickCheck spend all its time failing to get past an if clause somewhere? Worse, some QuickCheck implementations only run for brief periods, a matter of minutes, when quality random runs occur on the order of days, as discussed in the delightful Evaluating Fuzz Testing by Klees et al.

Hey, fuzz testers aren't blind though. Yep! Common fuzz testers will have instrumented the SUT and get online coverage feedback from the same, allowing the fuzzer to tailor inputs based on that coverage to explore other parts of the program or really dig into the current 'hot' area. Coverage is just one signal fuzzers use – there are tons more, go read Evaluating Fuzz Testing – but it's common. What fuzzers don't do very easily right now is the kind of model checking I described above, instead being almost solely concerned with inputting random nonsense into a function and then watching if said nonsense causes said function to panic. This makes discovering stateful crashes in a program unlikely finds, crashes like "the state machine only recognizes the character 'a' 3,427 times and thereafter silently discards it". QuickCheck finds this kind of thing, less so fuzzers. Some fuzzers do offer minimization but they do so not in the QuickCheck way of finding smaller examples of the type but by shrinking the input nonsense. This might end up being the same thing – hypothesis generates its types by coercing a byte buffer into types – but the interested reader can probably think up scenarios where a fuzzer's notion of shrinking and QuickCheck's notion of shrinking will diverge quite a bit in result.

What's up with bughunt using a fuzzer but pulling in the arbitrary crate? Well, I'm trying to make QuickCheck less blind by building on top of a fuzzer. There's prior art for this. QuickFuzz by Grieco et al does pretty well exactly what I'd like to do but in Haskell-land. My ambition with bughunt is, at present, this:

  1. Choose some data structure from Rust std.
  2. Build some 'obviously correct' model of that data structure.
  3. Build out a list of interesting 'operations' on that model.
  4. Build a fuzz target for the data structure which generates operations from the input byte blob, run the operations against the data structure and the model, compare.
  5. Loop on extending the operations list, model as necessary.

Let's look at an example. The first data structure I targeted for testing was std::collections::HashMap. The model code lives in src/stdlib/collections/hash_map.rs . The 'obviously correct model' starts on line 78, PropHashMap<K, V>. The underlying data store for this map is a vec of tuples. This vector is maintained in insertion order by key; this has awful algorithmic characteristics as a hash map but is very straightforward to understand and implement. The 'operations' start on line 161, an enum called Op<K, V> with variants like ShrinkToFit and Insert {k: K, v: V}. The Arbitrary definition is on line 196. Okay, so that's the obviously correct model and the operations. Where's the model checking bit? That happens in src/bin/stdlib/collections/hash_map.rs. You'll note this is a fuzz target, for honggfuzz specifically but for any fuzzer that follows a similar approach with just a few lines' code change. Line 13 establishes the pool of entropy; the FiniteBuffer holds the byte blob the fuzzer passes through and allows for the creation of Arbitrary instances from that blob, coercing bytes as appropriate, so long as they last. The next handful of lines do setup work, getting the initial capacity of the HashMap under test and so forth. Line 42 is where the real action is. It's at this line that the interpretive loop of the target begins. An operation, should enough bytes in the FiniteBuffer remain, is created and then applied to model and SUT. Here's the interpretation of Insert{k: K, v: V}:

Op::Insert { k, v } => {
    let model_res = model.insert(k, v);
    let sut_res = sut.insert(k, v);
    assert_eq!(model_res, sut_res);
}

Insert into the model, insert into SUT and assert equality of the result. Invariant checks start on line 98.

If you squint, this is a QuickCheck implementation backed by a fuzzer's capability to inspect the program being tested. There is, unfortunately, no shrinking (unless, like AFL, the fuzzer itself supports shrinking) and I'm not totally sure yet how to correct that, though I do have some ideas. I'm not a huge fan of how the model check code is separate from the model code itself, I'd rather place that in the same file under a test module. The way test results are reported is also totally dependent on how the fuzzer does things, itself a pain for debugging when combined with a lack of shrinking. So, the tooling is early days yet but the idea strikes me as sound and worth pursuing, especially with the prior art of QuickFuzz.

Have I found any bugs? Well, maybe. Not logic bugs, so that's a good deal for HashMap and VecDeque! What I have found are tons of ways to panic a program when doing allocation. If you go back to look in the HashMap target you'll see that the hash's initial capacity is chosen to be a value well below usize::max_value(). The tricky bit about fuzzing / QuickChecking the Rust standard library is that allocations which fail panic, as of this writing. It is true that I can catch and unwind panics but, to my knowledge, I can't distinguish an OOM panic from any other kind, say an assert_eq that doesn't assert equal. What to do? Well, for now, when possible I compute when a panic would happen but that's not totally foolproof. The less memory you have in your system, the more likely you are to OOM panic no matter how much memory you can theoretically allocate. I also make sure my allocations are as small as possible. That's why the structures under test are done over u8 or similar. If I don't take these precautions it's distressingly easy to OOM. This has been the trickiest thing about QuickCheck/fuzzing the Rust standard library, avoiding these allocation panics. Lesson learned: infallible dynamic memory allocation is a rough business. Happily there's work going on in the community to address this – see RFC #2116 – and it's all going along at a good clip.

What comes next? Well, the existing three targets I've got all test components of Rust that have had known defects in the past. It's time to dig these up and start measuring time-to-discover times for different fuzzers, different techniques. It's also time to start introducing custom allocators into these targets, see if I can't stumble into a tidy way to address the OOM issue. I've been fiddling with a tooling solution to unify the current approach behind some easy to comprehend thing. As is, there's too many little pieces flying around and it's easy to neglect one or misunderstand it, which is worse. I don't have any code public for that fiddling because it's something I work on as inspiration strikes. Maybe soon.