Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Support Incremental Compilation #594

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
339 changes: 339 additions & 0 deletions text/0000-incremental-compilation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,339 @@
- Start Date: 2015-01-18
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)

# Summary
This RFC proposes an incremental compilation strategy for `rustc` that allows for translation, codegen, and parts of static analysis to be done in an incremental fashion, without precluding the option of later expanding incrementality to parsing, macro expansion, and resolution.

In the C world, source code is split up into source files and header files, where source files are the unit of (re)compilation and header files contain 'interface' information that is needed to compile source files independently of each other. This RFC proposes an algorithm that could be described as (figuratively) splitting a Rust codebase into a set of source files, computing the minimal 'header/interface' information for each virtual source file and then using this information to determine if the object code for a given virtual source file (cached from a previous compiler invocation) can safely be re-used without having to recompile the source file.
(Note: This is just a metaphor, no actual header or source files are generated)

# Motivation
At the moment `rustc` takes a long time to compile anything but trivial programs.

# Detailed design
For making compilation incremental, the compiler must be changed in two ways:

1. The compilation process must produce artifacts that can be (re)compiled and cached independently of each other.
2. The compiler must track dependencies between the items in a program, so it can infer which artifacts to recompile and which to re-use from a previous compilation.

## Independent Compilation Artifacts
In order for the compiler to be able to re-use compilation artifacts from previous runs, these artifacts must be as independent of each other as possible. I propose to make the unit of (re)compilation individual functions and globals, that is, everything that ends up as a symbol in the output binary. At least the LLVM IR and object code of each function and global is cached, so the cache has two entries for every symbol, one in the form of object code, one in the form of LLVM IR.

Why object code? Because it is the final output of the compiler and can be passed immediately to the linker without having to re-run the costliest compiler passes: trans and codegen.

Why LLVM IR? Because if a fully optimized crate build is needed, it allows to generate an LLVM IR version of the crate without re-running parts of type-checking and trans. The individual IR fragments can be concatenated and LLVM can do its optimization passes on the whole crate then.

It might also make sense to cache other things in between compiler runs (e.g. important lookup tables) but the two above are certainly the most important ones. I will refer to the cached data for some symbol as `compilation artifact` in the remainder of the document.

## Dependency Tracking
First, lets define some terms that are useful for talking about the topic:

- A `program item` is any kind of function-, type-, or global variable definition. This includes functions, types, and statics defined locally within other functions and associated type assignments in `impl` blocks. It does not include modules, local variables, statements or expressions.
- A `program item interface` is that part of a `program item` that may be relevant for compiling *other* `program items`. Types are 'all interface' while for functions it's only the signature and for statics it's their type.
- A `program item body` is the full definition of a `program item`, i.e. the full functions definition, including the function body, respectively the whole global variable definition, including the initializer.

`Program items` can *depend* on each other. Let's define what that means more formally:

- A program item `A` *depends* on a program item `B` iff a change to `B` means that `A` needs to be re-compiled.

With these terms in place we can state that compiling a single `program item` will thus depend

1. on its own `program item body`, and
2. the `program item interfaces` of any `program items` (types, traits, functions, methods, globals) it **transitively** references

The dependency structure of a `program item` can be modeled as a directed graph:

* For the `program item body` there is a node in the graph. Note that `program items` that don't have a body (structs, enums, traits, ...) are only represented by their 'interface node' in the graph.
* For each `program item interface` there is one node in the graph.
* If a `program item interface` `A` directly references another `program item` `B` then there is an edge from `A`'s graph node to `B`'s interface node. That is, whenever the name of a type, function, or global occurs, there is an edge to the interface node of that type, function, or global.
* If a `program item body` `A` references another `program item` `B` then there is an edge from `A`'s graph node to `B`'s interface node.
* There's always an edge from a `program items` body node to its interface node.

Let's illustrate this with an example:

```rust
struct Kid {
name: &'static str
}

struct Tiger {
name: &'static str
}

struct Dinosaur {
name: &'static str,
stomach_contents: Gastropod
}

struct Gastropod {
name: &'static str,
height_in_stories: u32
}

fn transmogrify(kid: Kid) -> Tiger {
let intermediate_dinosaur = trans_internal(kid);
Tiger { name: intermediate_dinosaur.name }
}

fn trans_internal(kid: Kid) -> Dinosaur {
Dinosaur {
name: kid.name,
stomach_contents: Gastropod {
name: "Larry".to_string(),
height_in_stories: 500
}
}
}

fn main() {
let calvin = Kid { name: "Calvin" };
let hobbes = Tiger { name: "Hobbes" };

let calvin = transmogrify(calvin);

assert!(calvin < hobbes);
}
```

The dependency graph of the above program looks like this:

```
Gastropod <--- Dinosaur +--> Kid Tiger
^ ^ | ^ ^
| | | | |
| | | | |
| | | | |
| trans_internal ---+ transmogrify main
INTERFACES | ^ ^ ^ ^ ^
----------- | | | | | |
BODIES | | +---------+ | +----+ |
| | | | | |
trans_internal' transmogrify' main'

```

Note that, for readability, I've omitted some redundant edges in the above graph. An edge between two nodes can be omitted if there is another path between those nodes (as in `trans_internal' --> Kid`). More formally, two dependency graphs are equivalent for our purposes if their transitive closures are equal.

This dependency graph can be queried for finding all `program item interfaces` that a `program item` `A` depends on: Start at the node corresponding to `A` and collect all transitively reachable nodes. In the above example this means that `main()` needs to be recompiled if `Kid`, `Tiger` or the interface of `transmogrify` change (because these are reachable from `main`'s body node) but not if `Dinosaur`, `Gastropod` or any of the other function's body changes (because their nodes are not reachable from `main`'s body).

## Generic Program Items
So far the dependency graph has only described non-generic `program items`. For generic definitions the situation is a bit more complicated since their dependencies are only fully defined once all type parameters are substituted for concrete arguments. Consider the following example:

```rust
trait Transmogrifiable<T> {
fn transmogrify(self) -> T;
}

impl Transmogrifiable<Tiger> for Kid {
fn transmogrify(self) -> Tiger { ... }
}

impl Transmogrifiable<Gastropod> for Dinosaur {
fn transmogrify(self) -> Gastropod { ... }
}

fn transmogrify<TFrom, TTo>(val: TFrom) -> TTo
where TFrom: Transmogrifiable<TTo>
{
val.transmogrify()
}
```

In this example `transmogrify<Kid, Tiger>` will have a different dependency graph than `transmogrify<Dinosaur, Gastropod>`. In other words, the monomorphized implementation of `transmogrify<Kid, Tiger>` is not affected if the definition of `Dinosaur` or `Gastropod` changes and the dependency graph should reflect this.

One way to model this behavior is by not creating dependency graph nodes for generic `program items` but `node templates`, which---like generic items---have type parameters and yield a concrete, monomorphic dependency graph node, if all type parameters are substituted for concrete arguments. When the need arises to check if a particular monomorphized function implementation from the cache can be re-used, the dependency graph for the function can be constructed on demand from the given `node template` and parameter substitutions.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you explicitly stored the list of dependencies of each cache item in the cache, then it wouldn't be necessary to construct anything on demand - whether generic or not, you wouldn't have to go through the function to resolve types and such to determine what it actually depends on, which sounds like an improvement.

In any case, if you're caching object code, you need to store the list of functions that were inlined or otherwise had their behavior consulted by LLVM for codegen of the current function, or else you have to recompile a function whenever any of its transitive dependencies change in any way. (Unless, based on one of the notes below, you're giving up on combining optimization and incrementality? I guess it doesn't have to work in the initial version.)

Other than that I don't have much to say, but I will be eagerly watching any work on this, because I really hate slow compiles. :p

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ad storing explicit lists of dependencies:
I've thought about that too and in the end I think it doesn't make that much of a difference. It might well be that an actual implementation would do it that way. However, I for one have learned a lot about the problem at hand by trying to come up with a formal model to describe dependencies within a Rust program.

ad inlining:
Yes that is a problem. From a theoretical point of view it's not that hard, you just add an edge to the body-node of the inlined function in the dependency graph. The hard part is to find what will be inlined or (what was inlined if you are "recording" dependencies) because that's all LLVM turf.
But maybe it's not a big problem because you only cache unoptimized LLVM IR anyway and for object code it won't be possible to do any inlining, since implementations are not available. Special support could be added for #[inline(always)] functions. If you want a more optimized build you would have to use the granularity optimization from 'Miscellaneous' section and there you could be more conservative with adding dependency edges so you would catch inlining occurrences anyway. But that's definitely an interesting point that would need a lot of testing to get right, I think.


For consistency it also makes sense to treat non-generic `program items` as generic `program items` that happen to have zero type parameters. We thus obtain a direct correspondence between `program items` and `node templates` on the one hand and monomorphized `program item` instances and dependency graph nodes on the other.

## Program Item and Compilation Artifact Identifiers
`Program items` and their corresponding `node templates` must be identifiable in a way that is stable across multiple invocations of the compiler. The `DefId` type that is currently used for cross-crate item identification does not fit this requirement unfortunately, since it contains the rather unstable AST `NodeId`. Adding a single AST node currently can invalidate all `NodeId`s in the codebase due to the sequential Node ID assignment strategy.

One straightforward way of creating stable identifiers would be to use the `program item`'s path within the AST, while generating local integer IDs for anonymous blocks:

```rust
// ID: "M1"
mod M1 {

// ID: "M1::f1"
fn f1() {

// ID: "M1::f1::S1"
struct S1;

{
// ID: "M1::f1::0::S2"
struct S2;

}

{
// ID: "M1::f1::1::E1"
enum E1;
}
}

impl SomeTrait for SomeType {
fn something() { // ID <SomeType as M1::SomeTrait>::something
...
}
}
}
```

The dependency tracking system as described above contains `node templates` for `program item` definitions on a syntactic level, that is, for each `struct`, `enum`, `type`, `trait`, there is one `node template`, for each `fn`, `static`, and `const` there are two (one for the interface, one for the body). However, as seen in the section on generics, the codebase can refer to monomorphized instances of program items that cannot be identified by a single identifier as described above. A reference like `Option<String>` is a composite of multiple `program item` IDs, a tree of program item IDs in the general case:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the subject of monomorphized identifiers: you'll probably need to do something about symbol naming for monomorphizations of functions. Right now the name includes the hash of the pointers to the Tys representing the type arguments (which is random, thanks to ASLR). This does fine at preventing collisions, but it means you'll need to either record the mapping of (polymorphic function, type arguments) -> (symbol name) for use in later incremental builds, or fix symbol naming to produce something consistent. I tried to do the latter, but it wound up being a little more complicated than I expected (ADT Tys reference the struct/enum definition by its DefId, which is not stable) and I don't remember if I ever got it working.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a problem. I'd probably try to find a more stable symbol naming scheme.


```rust

Option<Result<u32,GenericError<String>>>

Option
|
Result
/ \
u32 GenericError
|
String
```

Incidentally, it is also such a composite ID that is needed for identifying `compilation artifacts` within the cache, since the cache only stores monomorphized instances of generic functions. For want of a better name I'll call it `mono-id` as in "identifier for a monomorphized type, function, trait, ...". Non-generic `program items` are just a special case of generic ones and need no special treatment.

## Determining If A Cache Entry Is Still Valid
This is the central question we want answered from the dependency tracking system: Given the need to include the object code of some (monomorphized) `program item` in the output binary, can we just reuse the object code already stored in the cache? This question can be answered by constructing the monomorphized dependency graph from the given `mono-id` and check each node for source code changes between cached and current version. In pseudo-rust:

```rust

trait MonoId {
fn definition_id(&self) -> ProgramItemId;
fn type_arguments(&self) -> &Map<TypeParam, MonoId>;
}

trait NodeTemplate {
// A fingerprint of the relevant parts of the resolved AST
fn description(&self) -> Fingerprint;
fn instantiate(&self, type_arguments: &Map<TypeParam, MonoId>) -> &Node;
}

trait Node {
fn dependencies(&self) -> &[MonoId];
}

fn has_interface_changed(
id: MonoId,
interface_node_templates: &Map<ProgramItemId, NodeTemplate>,
cache: &Cache)
-> bool {

let node_template = interface_node_templates.get(id.definition_id);

if node_template.description() !=
cache.load_interface_description(id.definition_id()) {
return true;
}

let node = node_template.instantiate(id.type_arguments());
for dependency_id in node.dependencies() {
if has_interface_changed(dependency_id, node_templates, cache) {
return true;
}
}

return false;
}

fn is_cached_implementation_still_valid(
id: MonoId,
interface_node_templates: &Map<ProgramItemId, NodeTemplate>,
body_node_templates: &Map<ProgramItemId, NodeTemplate>,
cache: &Cache)
-> bool {

let node_template = body_node_templates.get(id.definition_id());

if node_template.description() !=
cache.load_body_description(id.definition_id()) {
return false;
}

let node = node_template.instantiate(id.type_arguments());

for dependency_id in node.dependencies() {
if has_interface_changed(dependency_id, node_templates, cache) {
return false;
}
}

return true;
}

// Handling of cycles in dependency graph has been omitted for clarity.
```


## Incremental Compilation Algorithm
With these things in place, we can construct an algorithm for building a crate incrementally.

```
(1) Parse, Expand, Resolve whole crate
(2) Run type inference on all program items
(3) Build NodeTemplate map
(4) Add all monomorphic program items to set of needed compilation
artifacts
(5) Ensure that all needed compilation artifacts are in the cache:
(1) Take mono-id from list of needed compilation artifacts
(3) Construct dependency graph to determine whether cache entry
can be reused
(4) If not, compile item
(5) Add item fingerprint and compilation artifacts to cache
(6) If new monomorphized item instances are discovered during
compilation, add them to list of needed compilation artifacts
(6) Clear unreferenced entries from the cache
(7) Create output binary
(a) Link cached object files into output binary, or
(b) Concatenate cached IR fragments, do optimization and codegen,
then link
```


## Miscellaneous Aspects

### More fine-grained dependency tracking
It would be possible to make dependency tracking aware of the kind of reference one item makes to another. If an item `A` mentions another item `B` only via some reference type (e.g. `&T`), then item `A` only needs to be updated if `B` is removed or `B` changes its 'sized-ness'. This is comparable to how forward declarations in C are handled. In the dependency graph this would mean that there are different kinds of edges that trigger for different kinds of changes to items.

### Global Switches Influencing Codegen
There are many compiler flags that change the way the generated code looks like, e.g. optimization and debuginfo levels. A simple strategy to deal with this would be to store the set of compiler flags used for building the cache and clearing the cache completely if another set of flags is used. Another option is to keep multiple caches, each for a different set of compiler flags (e.g. keeping both on disk, a 'debug build cache' and a 'release build cache').
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash the relevant flags for the subdir name? I'd expect a lot of -C options affect the cache, and only storing one set wouldn't help at all for some usage patterns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, something like that. I'd like to see how big such a cache gets.


### Source Location Encoding
The current `codemap` implementation in `rustc` stores everything in one global 'address space'. This is unsuited for incremental compilation since old versions of source files need to be removed from the codemap and new source files need to be added. This leads to a similar problem as the sequentially assigned AST node IDs.

It would be better to store span information in byte-offsets relative to a given source file. Then modifying the codemap is possible without invalidating existing source locations stored somewhere in the cache.

### Physical Cache Structure
The cache could be kept somewhere within the output directory. Object code could be maintained in an `ar` archive so the linker can directly access it. Something similar might be possible for LLVM bitcode.

### Automatic inter-function optimizations
It should not be too hard to let the compiler keep track of which parts of the program change infrequently and then let it speculatively build object files with more than one function in them. For these aggregate object files inter-function LLVM optimizations could then be enabled, yielding faster object code at little additional cost. Other strategies for controlling cache granularity can be implemented in a similar fashion.

### Parallelization
If some care is taken in implementing the above concepts it should be rather easy to do translation and codegen in parallel for all items, since by design we already have (or can deterministically compute) all the information we need.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already can do codegen in parallel, although there is a bug preventing most use atm.


# Drawbacks

Implementing this will need a lot of work on the compiler architecture. But that's something that will be needed sooner or later anyway.

# Alternatives

I'd definitely like to hear about them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative I have been thinking about as a long term solution is full incremental compilation (as opposed to incremental codegen) where we could compile a single span, from parsing all the way to codegen. This is more useful for IDEs and similar tools, but would also give us more scope to incrementalise normal compilation. I envisaged generating much more thorough metadata for crates, enough that, for example, when a function is modified, that function could be compiled independently of the rest of the crate. This would require having all the type information we currently use for type checking in the metadata. (We would of course also need the object files and so forth used for incremental codegen).

The only way for this to be sane to implement would be if we had better representations of the compilers intermediate representations, and these could be serialised to make the metadata, rather than the current ad hoc approach (but this seems like a win from an architectural point of view too).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think of this RFC as just a first step towards an architecture that does even more things incrementally. For example the concept of 'interfaces' that I use in the RFC seems very fruitful to me in terms of thinking about what is really needed where and use that understanding to improve the whole compilation process. Once you have extracted these 'interfaces' almost everything else should be doable at a per-item level and thus in parallel (not just codegen, but also type-checking and many other parts of static analysis).

Anyway, I'd regard a fully incremental solution as the long-term goal too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run, both rustc and cargo will benefit from a general purpose caching and dependency management framework. I'm inclined to go big or go home on these things, so perhaps that could be developed separately as a library from the get-go? Between this proposal, and the new IR ones, sounds like rustc is basically going to be rewritten.


# Unresolved questions

## Dependency Graph Construction before Type Inference
I'm not sure if it would be possible to construct valid dependency graphs *before* type-inference or if that would miss some dependency edges. Or more generally, how much per-item work can be pushed until after caching strikes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epdtry found that he needed type information for constructing the dependency graph, although I don't recall why, exactly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall, type information wasn't strictly required, but it let the analysis obtain more precise dependency information for calls to trait methods. If a function contains x + y, knowing the types of x and y lets you find the precise implementation of add that's being called. Without type information, you must conservatively assume that it could be a call to any add implementation in scope.

I think this design would have less trouble operating without type information because (correct me if I'm wrong) the + would constitute a reference to the generic Add::add interface, not a reference to any specific implementation body. My design did not distinguish bodies from interfaces because inlining can happen anywhere, causing the body of one function to depend on another.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, @epdtry!

It occurred to me that the node template graph, together with the set of visible traits forms the dependency graph for type inference. So, while it may not be possible to accurately compute the dependencies of a function body without doing type inference, it should be possible to cache type inference results just like other compilation artefacts.


## RLIB/DYLIB metadata
I have not investigated how library metadata will be affected by this. I guess it must be made 'linkable' in some way or other.

## Debuginfo Redundancies
There might be a lot of debuginfo redundancies in cache entries because type debuginfo will be duplicated for each function that transitively refers to the type. Might use up a lot of disk space and make things slower than they need to be...

A lot more of these questions will probably pop up during implementation.