Home Explore Blog Models CI



rustc

3rd chunk of `src/rustdoc-internals/search.md`
2cb61a050155fe991cc04b818d82e970630e094e2cf17d720000000100000fa6
before serializing.
Doing this allows rustdoc to not only make the search index smaller,
but reuse the same string representing the parent path across multiple in-memory items.

### Representing sparse columns

#### VLQ Hex

This format is, as far as I know, used nowhere other than rustdoc.
It follows this grammar:

```ebnf
VLQHex = { VHItem | VHBackref }
VHItem = VHNumber | ( '{', {VHItem}, '}' )
VHNumber = { '@' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' }, ( '`' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k ' | 'l' | 'm' | 'n' | 'o' )
VHBackref = ( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | ':' | ';' | '<' | '=' | '>' | '?' )
```

A VHNumber is a variable-length, self-terminating base16 number
(terminated because the last hexit is lowercase while all others are uppercase).
The sign bit is represented using [zig-zag encoding].

This alphabet is chosen because the characters can be turned into hexits by
masking off the last four bits of the ASCII encoding.

A major feature of this encoding, as with all of the "compression" done in rustdoc,
is that it can remain in its compressed format *even in memory at runtime*.
This is why `HBackref` is only used at the top level,
and why we don't just use [Flate] for everything: the decoder in search.js
will reuse the entire decoded object whenever a backref is seen,
saving decode work and memory.


#### Roaring Bitmaps

Flag-style data, such as deprecation and empty descriptions,
are stored using the [standard Roaring Bitmap serialization format with runs].
The data is then base64 encoded when writing it.

As a brief overview: a roaring bitmap is a chunked array of bits,
described in [this paper].
A chunk can either be a list of integers, a bitfield, or a list of runs.
In any case, the search engine has to base64 decode it,
and read the chunk index itself,
but the payload data stays as-is.

All roaring bitmaps in rustdoc currently store a flag for each item index.
The crate is item 0, all others start at 1.


### How descriptions are stored

The largest amount of data,
and the main thing Rustdoc Search deals with that isn't
actually used for searching, is descriptions.
In a SERP table, this is what appears on the rightmost column.

> | item type | item path             | ***description*** (this part)                       |
> | --------- | --------------------- | --------------------------------------------------- |
> | function  | my_crate::my_function | This function gets the name of an integer with Data |

When someone runs a search in rustdoc for the first time, their browser will
work through a "sandwich workload" of three steps:

1. Download the search-index.js and search.js files (a network bottleneck).
2. Perform the actual search (a CPU and memory bandwidth bottleneck).
3. Download the description data (another network bottleneck).

Reducing the amount of data downloaded here will almost always increase latency,
by delaying the decision of what to download behind other work and/or adding
data dependencies where something can't be downloaded without first downloading
something else. In this case, we can't start downloading descriptions until
after the search is done, because that's what allows it to decide *which*
descriptions to download (it needs to sort the results then truncate to 200).

To do this, two columns are stored in the search index, building on both
Roaring Bitmaps and on VLQ Hex.

* `e` is an index of **e**mpty descriptions. It's a [roaring bitmap] of
  each item (the crate itself is item 0, the rest start at 1).
* `D` is a shard list, stored in [VLQ hex] as flat list of integers.
  Each integer gives you the number of descriptions in the shard.
  As the decoder walks the index, it checks if the description is empty.
  if it's not, then it's in the "current" shard. When all items are
  exhausted, it goes on to the next shard.

Inside each shard is a newline-delimited list of descriptions,

Title: Rustdoc Search Index: VLQ Hex, Roaring Bitmaps, and Description Storage
Summary
This section details how the Rustdoc Search index handles sparse columns and descriptions. It explains VLQ Hex encoding, used for compact data representation, and its ability to remain compressed in memory. It covers Roaring Bitmaps, a standard format for storing flag-style data like deprecation. Finally, it describes the 'sandwich workload' involved in searching and how descriptions are stored using Roaring Bitmaps for empty descriptions and VLQ Hex for shards containing non-empty descriptions.