code for. This is called _monomorphization collection_ and it happens at the
`MIR` level.
### Code generation
We then begin what is simply called _code generation_ or _codegen_. The [code
generation stage][codegen] is when higher-level representations of source are
turned into an executable binary. Since `rustc` uses LLVM for code generation,
the first step is to convert the `MIR` to `LLVM-IR`. This is where the `MIR` is
actually monomorphized. The `LLVM-IR` is passed to LLVM, which does a lot more
optimizations on it, emitting machine code which is basically assembly code
with additional low-level types and annotations added (e.g. an ELF object or
`WASM`). The different libraries/binaries are then linked together to produce
the final binary.
## How it does it
Now that we have a high-level view of what the compiler does to your code,
let's take a high-level view of _how_ it does all that stuff. There are a lot
of constraints and conflicting goals that the compiler needs to
satisfy/optimize for. For example,
- Compilation speed: how fast is it to compile a program? More/better
compile-time analyses often means compilation is slower.
- Also, we want to support incremental compilation, so we need to take that
into account. How can we keep track of what work needs to be redone and
what can be reused if the user modifies their program?
- Also we can't store too much stuff in the incremental cache because
it would take a long time to load from disk and it could take a lot
of space on the user's system...
- Compiler memory usage: while compiling a program, we don't want to use more
memory than we need.
- Program speed: how fast is your compiled program? More/better compile-time
analyses often means the compiler can do better optimizations.
- Program size: how large is the compiled binary? Similar to the previous
point.
- Compiler compilation speed: how long does it take to compile the compiler?
This impacts contributors and compiler maintenance.
- Implementation complexity: building a compiler is one of the hardest
things a person/group can do, and Rust is not a very simple language, so how
do we make the compiler's code base manageable?
- Compiler correctness: the binaries produced by the compiler should do what
the input programs says they do, and should continue to do so despite the
tremendous amount of change constantly going on.
- Integration: a number of other tools need to use the compiler in
various ways (e.g. `cargo`, `clippy`, `MIRI`) that must be supported.
- Compiler stability: the compiler should not crash or fail ungracefully on the
stable channel.
- Rust stability: the compiler must respect Rust's stability guarantees by not
breaking programs that previously compiled despite the many changes that are
always going on to its implementation.
- Limitations of other tools: `rustc` uses LLVM in its backend, and LLVM has some
strengths we leverage and some aspects we need to work around.
So, as you continue your journey through the rest of the guide, keep these
things in mind. They will often inform decisions that we make.
### Intermediate representations
As with most compilers, `rustc` uses some intermediate representations (IRs) to
facilitate computations. In general, working directly with the source code is
extremely inconvenient and error-prone. Source code is designed to be human-friendly while at
the same time being unambiguous, but it's less convenient for doing something
like, say, type checking.
Instead most compilers, including `rustc`, build some sort of IR out of the
source code which is easier to analyze. `rustc` has a few IRs, each optimized
for different purposes:
- Token stream: the lexer produces a stream of tokens directly from the source
code. This stream of tokens is easier for the parser to deal with than raw
text.
- Abstract Syntax Tree (`AST`): the abstract syntax tree is built from the stream
of tokens produced by the lexer. It represents