Home Explore Blog CI



rustc

4th chunk of `src/rustdoc-internals/search.md`
ee227185ed5045be6819b503fae149f399d4263bd9e38fb50000000100000fa0
by delaying the decision of what to download behind other work and/or adding
data dependencies where something can't be downloaded without first downloading
something else. In this case, we can't start downloading descriptions until
after the search is done, because that's what allows it to decide *which*
descriptions to download (it needs to sort the results then truncate to 200).

To do this, two columns are stored in the search index, building on both
Roaring Bitmaps and on VLQ Hex.

* `e` is an index of **e**mpty descriptions. It's a [roaring bitmap] of
  each item (the crate itself is item 0, the rest start at 1).
* `D` is a shard list, stored in [VLQ hex] as flat list of integers.
  Each integer gives you the number of descriptions in the shard.
  As the decoder walks the index, it checks if the description is empty.
  if it's not, then it's in the "current" shard. When all items are
  exhausted, it goes on to the next shard.

Inside each shard is a newline-delimited list of descriptions,
wrapped in a JSONP-style function call.


### `i`, `f`, and `p`

`i` and `f` both index into `p`, the array of parent items.

`i` is just a one-indexed number
(not zero-indexed because `0` is used for items that have no parent item).
It's different from `q` because `q` represents the parent *module or crate*,
which everything has,
while `i`/`q` are used for *type and trait-associated items* like methods.

`f`, the function signatures, use a [VLQ hex] tree.
A number is either a one-indexed reference into `p`,
a negative number representing a generic,
or zero for null.

(the internal object representation also uses negative numbers,
even after decoding,
to represent generics).

For example, `{{gb}{d}}` is equivalent to the json `[[3, 1], [2]]`.
Because of zigzag encoding, `` ` `` is +0, `a` is -0 (which is not used),
`b` is +1, and `c` is -1.

## Searching by name

Searching by name works by looping through the search index
and running these functions on each:

* [`editDistance`] is always used to determine a match
  (unless quotes are specified, which would use simple equality instead).
  It computes the number of swaps, inserts, and removes needed to turn
  the query name into the entry name.
  For example, `foo` has zero distance from itself,
  but a distance of 1 from `ofo` (one swap) and `foob` (one insert).
  It is checked against an heuristic threshold, and then,
  if it is within that threshold, the distance is stored for ranking.
* [`String.prototype.indexOf`] is always used to determine a match.
  If it returns anything other than -1, the result is added,
  even if `editDistance` exceeds its threshold,
  and the index is stored for ranking.
* [`checkPath`] is used if, and only if, a parent path is specified
  in the query. For example, `vec` has no parent path, but `vec::vec` does.
  Within checkPath, editDistance and indexOf are used,
  and the path query has its own heuristic threshold, too.
  If it's not within the threshold, the entry is rejected,
  even if the first two pass.
  If it's within the threshold, the path distance is stored
  for ranking.
* [`checkType`] is used only if there's a type filter,
  like the struct in `struct:vec`. If it fails,
  the entry is rejected.

If all four criteria pass
(plus the crate filter, which isn't technically part of the query),
the results are sorted by [`sortResults`].


## Searching by type

Searching by type can be divided into two phases,
and the second phase has two sub-phases.

* Turn names in the query into numbers.
* Loop over each entry in the search index:
   * Quick rejection using a bloom filter.
   * Slow rejection using a recursive type unification algorithm.

In the names->numbers phase, if the query has only one name in it,
the editDistance function is used to find a near match if the exact match fails,
but if there's multiple items in the query,
non-matching items are treated as generics instead.
This means `hahsmap` will match hashmap on its own, but `hahsmap, u32`

Title: Rustdoc Search Index: Description Storage, Parent Items, Function Signatures, and Search Algorithms
Summary
This section explains how descriptions are stored using Roaring Bitmaps and VLQ Hex. It then details the use of `i`, `f`, and `p` indices to represent parent items and function signatures using a VLQ hex tree. It then discusses the search algorithms used by Rustdoc, including editDistance, indexOf, checkPath, and checkType, along with the sorting process. Finally, it breaks down the type search process into phases, covering name conversion, bloom filter quick rejection, and a recursive type unification algorithm for slower rejection.