Representing AST phases

2023-09-02 7 min by Lukas Atkinson Rust (3)

It is common for a compiler to have multiple phases. For example, a compiler would first parse the syntax into an Abstract Syntax Tree (AST), and then actually resolve symbols such as variable names.

There are two ways to represent this in the compiler:

We can have different data structures for the resolved and unresolved version.
```
fn resolve(ast: UnresolvedAst) -> Result<Ast, Error> { ... }

enum Ast {
  Symbol(Value),
  ...
}

enum UnresolvedAst {
  Symbol(Name),
  ...  // rest is identical to Ast
}
```
This is an example of the typestate pattern. It ensures that we can never accidentally have unresolved variables when using the AST.

However, it requires lots of boilerplate code for the two nearly-equal AST data structures.
We can have a single AST data structure and modify it in-place during symbol resolution.
```
fn resolve(ast: &mut Ast) -> Result<(), Error> { ... }

// or:
fn resolve(ast: Ast) -> Result<Ast, Error> { ... }

enum Ast {
  Symbol(Symbol),
  ...
}

enum Symbol {
  Resolved(Value),
  Unresolved(Name),
}
```
This is much simpler to handle because we only have one AST data structure, but given an AST we can never be quite sure whether all symbols have already been resolved. This attracts bugs.

There are some strategies to make the "different types" approach less painful:

Defining the AST data structure via macros. This is a potentially flexible way to define multiple datastructures that follow the same general structure but differ in some details. This can also help with generating boilerplate code for converting between these representations.

However, using lots of macros tends make the code more difficult to understand and debug, since input to the Macro is typically not the full Rust syntax.
Making the AST data structure generic over relevant details. This is less powerful than using macros, but it's directly integrated into the type system and generally preferable if possible. Just as with macros, the Rust compiler will compile/specialize each instance separately, but it is now possible to implement shared methods for all instances.

However, there are some limitations to how generics can be used in Rust.

Representing generic ASTs

Let's look at generics in more detail. The core idea is to introduce a type parameter for the symbol:

enum Ast<S = ResolvedSymbol> {
  Symbol(S),
  ...
}

struct ResolvedSymbol(Value);
struct UnresolvedSymbol(Name);

fn resolve(ast: Ast<UnresolvedSymbol>) -> Ast { ... }

This works well for "leaf" values in the AST. The verbosity of the generic parameters can be erased using type aliases.

Generics in recursive data structures

However, Rust runs into difficulties when trying to make non-leaf values generic. Demonstrating this requires a more complex example, for example by introducing some expression types:

enum Ast {
  Symbol(Value),
  BinOp(Box<BinOp>),
  UnOp(Box<UnOp>),
}

enum BinOp {
  Plus(Ast, Ast),
  Minus(Ast, Ast),
}

enum UnOp {
  Minus(Ast),
}

To represent different dialects of a language, we might be tempted to make the AST generic over the BinOp and UnOp types, but this requires us to proliferate the type parameters over all other classes as well:

enum Ast<B, U> {
  Symbol(Value),
  BinOp(Box<B>),
  UnOp(Box<U>),
}

enum BinOp<U> {
  Plus(Ast<BinOp<U>, U>, Ast<BinOp<U>, U>),
  Minus(Ast<BinOp<U>, U>, Ast<BinOp<U>, U>),
}

enum UnOp<B> {
  Minus(Ast<B, UnOp<B>>),
}

Defining such types is legal, but we cannot actually instantiate them! The types BinOp<_> and UnOp<_> would be mutually recursive:

type NormalAst = Ast<BinOp<UnOp<BinOp<_>>>, UnOp<BinOp<UnOp<_>>>>;

We could also try parametrizing BinOp/UnOp over the Ast type, which reduces the generics spam:

enum Ast<B, U> {
  Symbol(Value),
  BinOp(Box<B>),
  UnOp(Box<U>),
}

enum BinOp<A> {
  Plus(A, A),
  Minus(A, A),
}

enum UnOp<A> {
  Minus(A),
}

But again, we cannot instantiate the desired type as the Ast parameters are recursive:

type NormalAst = Ast<BinOp<Ast<_>>, UnOp<Ast<_>>>;

Using a newtype as a fixpoint

In C++ or Java, a common way to get rid of such recurring generic parameters is the Curiously Recurring Template Pattern (CRTP):

template<typename T>
struct Base { ... };

struct Derived: Base<Derived> { ... };

Rust of course does not have inheritance. But the important part for the CRTP is not the inheritance, but being able to specify the Self type in a generic parameter that defines this type (which is directly relevant to the symbol resolution example this article started with…).

In Rust, we can try a newtype struct instead of the type alias:

struct NormalAst(Ast<BinOp<Self>, UnOp<Self>>);

This does work, but now requires wrapping any Ast value with the newtype constructor, which is rather annoying.

Using type families

Another way to approach this is to give the Ast not an instantiated BinOp<T> type, but a BinOp type constructor. In C++, that would involve template-template parameters:

template<
  template<typename> B,
  template<typename> U,
>
struct Ast {
  B<Ast<B, U>> binop;
  U<Ast<B, U>> unop;
};

This is an example of higher-kinded types (HKT). Rust does not directly support HKTs in generic parameters, but can encode them via associated types. But that requires that we define another type as a container for these HKTs, and a trait so that we can constrain the generics on Ast.

However, associated types already allow for a simpler solution: we can define a helper type that contains all the types we need. In C++, these kinds of helper types would be called a "trait". Here, I'll call it a "family".

enum Ast<T: AstFamily> {
  Symbol(Value),
  BinOp(Box<T::BinOp>),
  UnOp(Box<T::UnOp>),
}

enum BinOp<T: AstFamily> {
  Plus(T::Ast, T::Ast),
  Minus(T::Ast, T::Ast),
}

enum UnOp<T: AstFamily> {
  Minus(T::Ast),
}

trait AstFamily {
  type Ast: Debug;
  type BinOp: Debug;
  type UnOp: Debug;
}

struct NormalAst;

impl AstFamily for NormalAst {
  type Ast = Ast<Self>;
  type BinOp = BinOp<Self>;
  type UnOp = UnOp<Self>;
}

This too does work, and might be preferable for larger groups of interrelated types.

Sometimes, not doing any of this is simpler

The last couple of sections looked at some pretty advanced type system machinery to "elegantly" describe different variants of an AST data structure.

But this is still a lot of complexity, and sometimes it is simpler to have a single data structure that can represent all variants, even if that is a bit less type safe.

It is still possible to have a high-quality codebase, even if some states should be impossible. For example:

Use asserts (in Rust: maybe debug_assert!()) to check that the input data satisfies expected invariants, e.g. that all symbols have been resolved. Running this check over an AST will be fairly efficient, compared to allocating a completely new AST data structure.
Use property-based testing and fuzzing to find unhandled edge cases in the code. In my experience, this is highly effective at finding these kinds of bugs. However, generating mutually recursive input data structures tends to be tricky…

For an introduction to proptests see the hypothesis.works website from the Python hypothesis package. In Rust, use the proptest crate for writing such tests.

This write-up was prompted by the Software Engineering Stack Exchange question: How can I represent a transformed AST between compilation stages?

licensing	CC-BY-SA-4.0