Ragithub

# Incremental compilation

<!-- toc -->

The incremental compilation scheme is, in essence, a surprisingly
simple extension to the overall query system. We'll start by describing
a slightly simplified variant of the real thing – the "basic algorithm" –
and then describe some possible improvements.

## The basic algorithm

The basic algorithm is
called the **red-green** algorithm[^salsa]. The high-level idea is
that, after each run of the compiler, we will save the results of all
the queries that we do, as well as the **query DAG**. The
**query DAG** is a [DAG] that indexes which queries executed which
other queries. So, for example, there would be an [edge] from a query Q1
to another query Q2 if computing Q1 required computing Q2 (note that
because queries cannot depend on themselves, this results in a DAG and
not a general graph).


> **NOTE**: You might think of a query as simply the definition of a query.
> A thing that you can invoke, a bit like a function, 
> and which either returns a cached result or actually executes the code.
> 
> If that's the way you think about queries,
> it's good to know that in the following text, queries will be said to have colours. 
> Keep in mind though, that here the word query also refers to a certain invocation of 
> the query for a certain input. As you will read later, queries are fingerprinted based 
> on their arguments. The result of a query might change when we give it one argument 
> and be coloured red, while it stays the same for another argument and is thus green.
> 
> In short, the word query is here not just used to mean the definition of a query, 
> but also for a specific instance of that query with given arguments.

On the next run of the compiler, then, we can sometimes reuse these
query results to avoid re-executing a query. We do this by assigning
every query a **color**:

- If a query is colored **red**, that means that its result during
  this compilation has **changed** from the previous compilation.
- If a query is colored **green**, that means that its result is
  the **same** as the previous compilation.

There are two key insights here:

- First, if all the inputs to query Q are colored green, then the
  query Q **must** result in the same value as last time and hence
  need not be re-executed (or else the compiler is not deterministic).
- Second, even if some inputs to a query changes, it may be that it
  **still** produces the same result as the previous compilation. In
  particular, the query may only use part of its input.
  - Therefore, after executing a query, we always check whether it
    produced the same result as the previous time. **If it did,** we
    can still mark the query as green, and hence avoid re-executing
    dependent queries.

### The try-mark-green algorithm

At the core of incremental compilation is an algorithm called
"try-mark-green". It has the job of determining the color of a given
query Q (which must not have yet been executed). In cases where Q has
red inputs, determining Q's color may involve re-executing Q so that
we can compare its output, but if all of Q's inputs are green, then we
can conclude that Q must be green without re-executing it or inspecting
its value at all. In the compiler, this allows us to avoid
deserializing the result from disk when we don't need it, and in fact
enables us to sometimes skip *serializing* the result as well
(see the refinements section below).

Try-mark-green works as follows:

- First check if the query Q was executed during the previous compilation.
  - If not, we can just re-execute the query as normal, and assign it the
    color of red.
- If yes, then load the 'dependent queries' of Q.
- If there is a saved result, then we load the `reads(Q)` vector from the
  query DAG. The "reads" is the set of queries that Q executed during
  its execution.
  - For each query R in `reads(Q)`, we recursively demand the color
    of R using try-mark-green.
    - Note: it is important that we visit each node in `reads(Q)` in same order
      as they occurred in the original compilation. See [the section on the
      query DAG below](#dag).
    - If **any** of the nodes in `reads(Q)` wind up colored **red**, then Q is
      dirty.
      - We re-execute Q and compare the hash of its result to the hash of the
        result from the previous compilation.
      - If the hash has not changed, we can mark Q as **green** and return.
    - Otherwise, **all** of the nodes in `reads(Q)` must be **green**. In that
      case, we can color Q as **green** and return.

<a id="dag"></a>

### The query DAG

The query DAG code is stored in
[`compiler/rustc_middle/src/dep_graph`][dep_graph]. Construction of the DAG is done
by instrumenting the query execution.

One key point is that the query DAG also tracks ordering; that is, for
each query Q, we not only track the queries that Q reads, we track the
**order** in which they were read.  This allows try-mark-green to walk
those queries back in the same order. This is important because once a
subquery comes back as red, we can no longer be sure that Q will continue
along the same path as before. That is, imagine a query like this:

```rust,ignore
fn main_query(tcx) {
    if tcx.subquery1() {
        tcx.subquery2()
    } else {
        tcx.subquery3()
    }
}
```

Now imagine that in the first compilation, `main_query` starts by
executing `subquery1`, and this returns true. In that case, the next
query `main_query` executes will be `subquery2`, and `subquery3` will
not be executed at all.

But now imagine that in the **next** compilation, the input has
changed such that `subquery1` returns **false**. In this case, `subquery2`
would never execute. If try-mark-green were to visit `reads(main_query)` out
of order, however, it might visit `subquery2` before `subquery1`, and hence
execute it.
This can lead to ICEs and other problems in the compiler.


## Improvements to the basic algorithm

In the description of the basic algorithm, we said that at the end of
compilation we would save the results of all the queries that were
performed.  In practice, this can be quite wasteful – many of those
results are very cheap to recompute, and serializing and deserializing
them is not a particular win. In practice, what we would do is to save
**the hashes** of all the subqueries that we performed. Then, in select cases,
we **also** save the results.

This is why the incremental algorithm separates computing the
**color** of a node, which often does not require its value, from
computing the **result** of a node. Computing the result is done via a simple
algorithm like so:

- Check if a saved result for Q is available. If so, compute the color of Q.
  If Q is green, deserialize and return the saved result.
- Otherwise, execute Q.
  - We can then compare the hash of the result and color Q as green if
    it did not change.

## Resources
The initial design document can be found [here][initial-design], which expands
on the memoization details, provides more high-level overview and motivation
for this system.

# Footnotes