So, as you continue your journey through the rest of the guide, keep these
things in mind. They will often inform decisions that we make.
### Intermediate representations
As with most compilers, `rustc` uses some intermediate representations (IRs) to
facilitate computations. In general, working directly with the source code is
extremely inconvenient and error-prone. Source code is designed to be human-friendly while at
the same time being unambiguous, but it's less convenient for doing something
like, say, type checking.
Instead most compilers, including `rustc`, build some sort of IR out of the
source code which is easier to analyze. `rustc` has a few IRs, each optimized
for different purposes:
- Token stream: the lexer produces a stream of tokens directly from the source
code. This stream of tokens is easier for the parser to deal with than raw
text.
- Abstract Syntax Tree (`AST`): the abstract syntax tree is built from the stream
of tokens produced by the lexer. It represents
pretty much exactly what the user wrote. It helps to do some syntactic sanity
checking (e.g. checking that a type is expected where the user wrote one).
- High-level IR (HIR): This is a sort of desugared `AST`. It's still close
to what the user wrote syntactically, but it includes some implicit things
such as some elided lifetimes, etc. This IR is amenable to type checking.
- Typed `HIR` (THIR) _formerly High-level Abstract IR (HAIR)_: This is an
intermediate between `HIR` and MIR. It is like the `HIR` but it is fully typed
and a bit more desugared (e.g. method calls and implicit dereferences are
made fully explicit). As a result, it is easier to lower to `MIR` from `THIR` than
from HIR.
- Middle-level IR (`MIR`): This IR is basically a Control-Flow Graph (CFG). A CFG
is a type of diagram that shows the basic blocks of a program and how control
flow can go between them. Likewise, `MIR` also has a bunch of basic blocks with
simple typed statements inside them (e.g. assignment, simple computations,
etc) and control flow edges to other basic blocks (e.g., calls, dropping
values). `MIR` is used for borrow checking and other
important dataflow-based checks, such as checking for uninitialized values.
It is also used for a series of optimizations and for constant evaluation (via
`MIRI`). Because `MIR` is still generic, we can do a lot of analyses here more
efficiently than after monomorphization.
- `LLVM-IR`: This is the standard form of all input to the LLVM compiler. `LLVM-IR`
is a sort of typed assembly language with lots of annotations. It's
a standard format that is used by all compilers that use LLVM (e.g. the clang
C compiler also outputs `LLVM-IR`). `LLVM-IR` is designed to be easy for other
compilers to emit and also rich enough for LLVM to run a bunch of
optimizations on it.
One other thing to note is that many values in the compiler are _interned_.
This is a performance and memory optimization in which we allocate the values in
a special allocator called an
_[arena]_. Then, we pass
around references to the values allocated in the arena. This allows us to make
sure that identical values (e.g. types in your program) are only allocated once
and can be compared cheaply by comparing pointers. Many of the intermediate
representations are interned.
### Queries
The first big implementation choice is Rust's use of the _query_ system in its
compiler. The Rust compiler _is not_ organized as a series of passes over the
code which execute sequentially. The Rust compiler does this to make
incremental compilation possible -- that is, if the user makes a change to
their program and recompiles, we want to do as little redundant work as
possible to output the new binary.
In `rustc`, all the major steps above are organized as a bunch of queries that
call each other. For example, there is a query to ask for the type of something
and another to ask for the optimized `MIR` of a function. These queries can call