Overview of the Rust Compiler: Lexing, Parsing, and Initial Stages

# Overview of the compiler  This chapter is about the overall process of compiling a program -- how everything fits together. The Rust compiler is special in two ways: it does things to your code that other compilers don't do (e.g. borrow-checking) and it has a lot of unconventional implementation choices (e.g. queries). We will talk about these in turn in this chapter, and in the rest of the guide, we will look at the individual pieces in more detail. ## What the compiler does to your code So first, let's look at what the compiler does to your code. For now, we will avoid mentioning how the compiler implements these steps except as needed. ### Invocation Compilation begins when a user writes a Rust source program in text and invokes the `rustc` compiler on it. The work that the compiler needs to perform is defined by command-line options. For example, it is possible to enable nightly features (`-Z` flags), perform `check`-only builds, or emit the LLVM Intermediate Representation (`LLVM-IR`) rather than executable machine code. The `rustc` executable call may be indirect through the use of `cargo`. Command line argument parsing occurs in the [`rustc_driver`]. This crate defines the compile configuration that is requested by the user and passes it to the rest of the compilation process as a [`rustc_interface::Config`]. ### Lexing and parsing The raw Rust source text is analyzed by a low-level *lexer* located in [`rustc_lexer`]. At this stage, the source text is turned into a stream of atomic source code units known as _tokens_. The `lexer` supports the Unicode character encoding. The token stream passes through a higher-level lexer located in [`rustc_parse`] to prepare for the next stage of the compile process. The [`Lexer`] `struct` is used at this stage to perform a set of validations and turn strings into interned symbols (_interning_ is discussed later). [String interning] is a way of storing only one immutable copy of each distinct string value. The lexer has a small interface and doesn't depend directly on the diagnostic infrastructure in `rustc`. Instead it provides diagnostics as plain data which are emitted in [`rustc_parse::lexer`] as real diagnostics. The `lexer` preserves full fidelity information for both IDEs and procedural macros (sometimes referred to as "proc-macros"). The *parser* [translates the token stream from the `lexer` into an Abstract Syntax Tree (AST)][parser]. It uses a recursive descent (top-down) approach to syntax analysis. The crate entry points for the `parser` are the [`Parser::parse_crate_mod()`][parse_crate_mod] and [`Parser::parse_mod()`][parse_mod] methods found in [`rustc_parse::parser::Parser`]. The external module parsing entry point is [`rustc_expand::module::parse_external_mod`][parse_external_mod]. And the macro-`parser` entry point is [`Parser::parse_nonterminal()`][parse_nonterminal]. Parsing is performed with a set of [`parser`] utility methods including [`bump`], [`check`], [`eat`], [`expect`], [`look_ahead`]. Parsing is organized by semantic construct. Separate `parse_*` methods can be found in the [`rustc_parse`][rustc_parse_parser_dir] directory. The source file name follows the construct name. For example, the following files are found in the `parser`: - [`expr.rs`](https://github.com/rust-lang/rust/blob/master/compiler/rustc_parse/src/parser/expr.rs) - [`pat.rs`](https://github.com/rust-lang/rust/blob/master/compiler/rustc_parse/src/parser/pat.rs) - [`ty.rs`](https://github.com/rust-lang/rust/blob/master/compiler/rustc_parse/src/parser/ty.rs) - [`stmt.rs`](https://github.com/rust-lang/rust/blob/master/compiler/rustc_parse/src/parser/stmt.rs) This naming scheme is used across many compiler stages. You will find either a file or directory with the same name across the parsing, lowering, type checking, [Typed High-level Intermediate Representation (`THIR`)][thir] lowering, and [Mid-level Intermediate Representation (`MIR`)][mir] building sources. Macro-expansion, `AST`-validation, name-resolution, and early linting also take

This section introduces the overall compilation process of the Rust compiler, highlighting its unique features and unconventional implementation choices. It details the initial stages, including invocation, lexing (converting source text into tokens), and parsing (translating tokens into an Abstract Syntax Tree or AST). The organization of parsing by semantic construct is also discussed, with examples of files handling expressions, patterns, types, and statements.