Rustdoc Search Index: VLQ Hex, Roaring Bitmaps, and Description Storage

before serializing. Doing this allows rustdoc to not only make the search index smaller, but reuse the same string representing the parent path across multiple in-memory items. ### Representing sparse columns #### VLQ Hex This format is, as far as I know, used nowhere other than rustdoc. It follows this grammar: ```ebnf VLQHex = { VHItem | VHBackref } VHItem = VHNumber | ( '{', {VHItem}, '}' ) VHNumber = { '@' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' }, ( '`' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k ' | 'l' | 'm' | 'n' | 'o' ) VHBackref = ( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | ':' | ';' | '<' | '=' | '>' | '?' ) ``` A VHNumber is a variable-length, self-terminating base16 number (terminated because the last hexit is lowercase while all others are uppercase). The sign bit is represented using [zig-zag encoding]. This alphabet is chosen because the characters can be turned into hexits by masking off the last four bits of the ASCII encoding. A major feature of this encoding, as with all of the "compression" done in rustdoc, is that it can remain in its compressed format *even in memory at runtime*. This is why `HBackref` is only used at the top level, and why we don't just use [Flate] for everything: the decoder in search.js will reuse the entire decoded object whenever a backref is seen, saving decode work and memory. #### Roaring Bitmaps Flag-style data, such as deprecation and empty descriptions, are stored using the [standard Roaring Bitmap serialization format with runs]. The data is then base64 encoded when writing it. As a brief overview: a roaring bitmap is a chunked array of bits, described in [this paper]. A chunk can either be a list of integers, a bitfield, or a list of runs. In any case, the search engine has to base64 decode it, and read the chunk index itself, but the payload data stays as-is. All roaring bitmaps in rustdoc currently store a flag for each item index. The crate is item 0, all others start at 1. ### How descriptions are stored The largest amount of data, and the main thing Rustdoc Search deals with that isn't actually used for searching, is descriptions. In a SERP table, this is what appears on the rightmost column. > | item type | item path | ***description*** (this part) | > | --------- | --------------------- | --------------------------------------------------- | > | function | my_crate::my_function | This function gets the name of an integer with Data | When someone runs a search in rustdoc for the first time, their browser will work through a "sandwich workload" of three steps: 1. Download the search-index.js and search.js files (a network bottleneck). 2. Perform the actual search (a CPU and memory bandwidth bottleneck). 3. Download the description data (another network bottleneck). Reducing the amount of data downloaded here will almost always increase latency, by delaying the decision of what to download behind other work and/or adding data dependencies where something can't be downloaded without first downloading something else. In this case, we can't start downloading descriptions until after the search is done, because that's what allows it to decide *which* descriptions to download (it needs to sort the results then truncate to 200). To do this, two columns are stored in the search index, building on both Roaring Bitmaps and on VLQ Hex. * `e` is an index of **e**mpty descriptions. It's a [roaring bitmap] of each item (the crate itself is item 0, the rest start at 1). * `D` is a shard list, stored in [VLQ hex] as flat list of integers. Each integer gives you the number of descriptions in the shard. As the decoder walks the index, it checks if the description is empty. if it's not, then it's in the "current" shard. When all items are exhausted, it goes on to the next shard. Inside each shard is a newline-delimited list of descriptions,

This section details how the Rustdoc Search index handles sparse columns and descriptions. It explains VLQ Hex encoding, used for compact data representation, and its ability to remain compressed in memory. It covers Roaring Bitmaps, a standard format for storing flag-style data like deprecation. Finally, it describes the 'sandwich workload' involved in searching and how descriptions are stored using Roaring Bitmaps for empty descriptions and VLQ Hex for shards containing non-empty descriptions.