Ragithub

# Parallel compilation

<div class="warning">
As of <!-- date-check --> November 2024,
the parallel front-end is undergoing significant changes,
so this page contains quite a bit of outdated information.

Tracking issue: <https://github.com/rust-lang/rust/issues/113349>
</div>

As of <!-- date-check --> November 2024, most of the rust compiler is now
parallelized.

- The codegen part is executed concurrently by default. You can use the `-C
  codegen-units=n` option to control the number of concurrent tasks.
- The parts after HIR lowering to codegen such as type checking, borrowing
  checking, and mir optimization are parallelized in the nightly version.
  Currently, they are executed in serial by default, and parallelization is
  manually enabled by the user using the `-Z threads = n` option.
- Other parts, such as lexical parsing, HIR lowering, and macro expansion, are
  still executed in serial mode.

<div class="warning">
The following sections are kept for now but are quite outdated.
</div>

---


## Code generation

During monomorphization the compiler splits up all the code to
be generated into smaller chunks called _codegen units_. These are then generated by
independent instances of LLVM running in parallel. At the end, the linker
is run to combine all the codegen units together into one binary. This process
occurs in the [`rustc_codegen_ssa::base`] module.


## Data structures

The underlying thread-safe data-structures used in the parallel compiler
can be found in the [`rustc_data_structures::sync`] module. These data structures 
are implemented differently depending on whether `parallel-compiler` is true.

| data structure                   | parallel                                            | non-parallel |
| -------------------------------- | --------------------------------------------------- | ------------ |
| Lock\<T> | (parking_lot::Mutex\<T>) | (std::cell::RefCell) |
| RwLock\<T> | (parking_lot::RwLock\<T>) | (std::cell::RefCell) |
| MTLock\<T> | (Lock\<T>) | (T) |
| ReadGuard | parking_lot::RwLockReadGuard | std::cell::Ref |
| MappedReadGuard | parking_lot::MappedRwLockReadGuard | std::cell::Ref |
| WriteGuard | parking_lot::RwLockWriteGuard | std::cell::RefMut |
| MappedWriteGuard | parking_lot::MappedRwLockWriteGuard | std::cell::RefMut |
| LockGuard | parking_lot::MutexGuard | std::cell::RefMut |

- These thread-safe data structures are interspersed during compilation which
  can cause lock contention resulting in degraded performance as the number of
  threads increases beyond 4. So we audit the use of these data structures
  which leads to either a refactoring so as to reduce the use of shared state,
  or the authoring of persistent documentation covering the specific of the
  invariants, the atomicity, and the lock orderings.

- On the other hand, we still need to figure out what other invariants 
  during compilation might not hold in parallel compilation.


### WorkerLocal

[`WorkerLocal`] is a special data structure implemented for parallel compilers. It
holds worker-locals values for each thread in a thread pool. You can only
access the worker local value through the `Deref` `impl` on the thread pool it
was constructed on. It panics otherwise. 

`WorkerLocal` is used to implement the `Arena` allocator in the parallel
environment, which is critical in parallel queries. Its implementation is
located in the [`rustc_data_structures::sync::worker_local`] module. However,
in the non-parallel compiler, it is implemented as `(OneThread<T>)`, whose `T`
can be accessed directly through `Deref::deref`.


## Parallel iterator

The parallel iterators provided by the [`rayon`] crate are easy ways to
implement parallelism. In the current implementation of the parallel compiler
we use a custom [fork][rustc-rayon] of `rayon` to run tasks in parallel.

Some iterator functions are implemented to run loops in parallel 
when `parallel-compiler` is true.

| Function(Omit `Send` and `Sync`)                             | Introduction                                                 | Owning Module              |
| ------------------------------------------------------------ | ------------------------------------------------------------ | -------------------------- |
| **par_iter**<T: IntoParallelIterator>(t: T) -> T::Iter       | generate a parallel iterator                                 | rustc_data_structure::sync |
| **par_for_each_in**<T: IntoParallelIterator>(t: T, for_each: impl Fn(T::Item)) | generate a parallel iterator and run `for_each` on each element | rustc_data_structure::sync |
| **Map::par_body_owners**(self, f: impl Fn(LocalDefId))       | run `f` on all hir owners in the crate                       | rustc_middle::hir::map     |
| **Map::par_for_each_module**(self, f: impl Fn(LocalDefId))   | run `f` on all modules and sub modules in the crate          | rustc_middle::hir::map     |
| **ModuleItems::par_items**(&self, f: impl Fn(ItemId))        | run `f` on all items in the module                           | rustc_middle::hir          |
| **ModuleItems::par_trait_items**(&self, f: impl Fn(TraitItemId)) | run `f` on all trait items in the module                     | rustc_middle::hir          |
| **ModuleItems::par_impl_items**(&self, f: impl Fn(ImplItemId)) | run `f` on all impl items in the module                      | rustc_middle::hir          |
| **ModuleItems::par_foreign_items**(&self, f: impl Fn(ForeignItemId)) | run `f` on all foreign items in the module                   | rustc_middle::hir          |

There are a lot of loops in the compiler which can possibly be parallelized
using these functions. As of <!-- date-check--> August 2022, scenarios where
the parallel iterator function has been used are as follows:

| caller                                                  | scenario                                                     | callee                   |
| ------------------------------------------------------- | ------------------------------------------------------------ | ------------------------ |
| rustc_metadata::rmeta::encoder::prefetch_mir            | Prefetch queries which will be needed later by metadata encoding | par_iter                 |
| rustc_monomorphize::collector::collect_crate_mono_items | Collect monomorphized items reachable from non-generic items | par_for_each_in          |
| rustc_interface::passes::analysis                       | Check the validity of the match statements                   | Map::par_body_owners     |
| rustc_interface::passes::analysis                       | MIR borrow check                                             | Map::par_body_owners     |
| rustc_typeck::check::typeck_item_bodies                 | Type check                                                   | Map::par_body_owners     |
| rustc_interface::passes::hir_id_validator::check_crate  | Check the validity of hir                                    | Map::par_for_each_module |
| rustc_interface::passes::analysis                       | Check the validity of loops body, attributes, naked functions, unstable abi, const bodys | Map::par_for_each_module |
| rustc_interface::passes::analysis                       | Liveness and intrinsic checking of MIR                       | Map::par_for_each_module |
| rustc_interface::passes::analysis                       | Deathness checking                                           | Map::par_for_each_module |
| rustc_interface::passes::analysis                       | Privacy checking                                             | Map::par_for_each_module |
| rustc_lint::late::check_crate                           | Run per-module lints                                         | Map::par_for_each_module |
| rustc_typeck::check_crate                               | Well-formedness checking                                         | Map::par_for_each_module |

There are still many loops that have the potential to use parallel iterators.

## Query system

The query model has some properties that make it actually feasible to evaluate
multiple queries in parallel without too much effort:

- All data a query provider can access is via the query context, so
  the query context can take care of synchronizing access.
- Query results are required to be immutable so they can safely be used by
  different threads concurrently.

When a query `foo` is evaluated, the cache table for `foo` is locked.

- If there already is a result, we can clone it, release the lock and
  we are done.
- If there is no cache entry and no other active query invocation computing the
  same result, we mark the key as being "in progress", release the lock and
  start evaluating.
- If there *is* another query invocation for the same key in progress, we
  release the lock, and just block the thread until the other invocation has
  computed the result we are waiting for. **Cycle error detection** in the parallel 
  compiler requires more complex logic than in single-threaded mode. When 
  worker threads in parallel queries stop making progress due to interdependence, 
  the compiler uses an extra thread *(named deadlock handler)* to detect, remove and 
  report the cycle error.

The parallel query feature still has implementation to do, most of which is
related to the previous `Data Structures` and `Parallel Iterators`. See [this
open feature tracking issue][tracking].

## Rustdoc

As of <!-- date-check--> November 2022, there are still a number of steps to
complete before `rustdoc` rendering can be made parallel (see a open discussion
of [parallel `rustdoc`][parallel-rustdoc]).

## Resources

Here are some resources that can be used to learn more:

- [This IRLO thread by alexchricton about performance][irlo1]
- [This IRLO thread by Zoxc, one of the pioneers of the effort][irlo0]
- [This list of interior mutability in the compiler by nikomatsakis][imlist]