Transforming Syntax

Or: how to write the easy part of a compiler

(18 min read)

A Stack Overflow question asked how to translate a VB-like conditional into a C-like ternary. The other answers suggested regexes or treating it as Perl code *shudder*. But transpiling code to another language can be done correctly.

This post aims to cover:

  • parsing with Marpa::R2,
  • AST manipulation,
  • optimization passes,
  • compilation, and
  • Perl OO.

In the end, we’ll be able to do all that in only 200 lines of code!

Since this post is already rather long, we will not discuss parsing theory. You are expected to be familiar with EBNF grammar notation.

Note: All runnable code snippets should be prefixed by

use strict;
use warnings;
use 5.010;

Contents:

Problem statement: How to translate an expression into a C-like language

This tutorial was prompted by a question on Stack Overflow by user2516265. Here it is, slightly edited:

I am writing a script to read an Excel file translate the text into C programing syntax.

So in the Excel sheet I have string like this:

  If ((Myvalue.xyz == 1) Or (Frame_1.signal_1 == 1))
  Then a = 1 Else a = 0;

I am trying to create this output:

a = (((Myvalue.xyz == 1) || (Frame_1.signal_1 == 1)) ? 1 : 0)

How this can be handled in Perl?

Short introduction to Marpa

Various parser exists for Perl. Some of these are ancient tech, some are solid but slow, some are awesome. My favourite parsers are:

Regexp::Grammars
Provides many powerful and expressive tools. Retains the full power of Perl regexes. The downside: now everything is one huge regex.
Parser::MGC
Makes it trivial to write an object oriented top-down parser. However, the resulting code is unneccessarily verbose.
Recursive Descent
It is often best to write a RecDesc parser by hand. This affords a lot of flexibility, but can get tedious.
Marpa::R2
This is a wrapper around libmarpa. It is written in C, and uses Early parsing instead of top-down techniques. This makes it fairly easy to write arbitrary BNF, which Marpa will happily use. While it is strictly less powerful than regexes, Marpa provides excellent interfaces and good debugging help that often makes it worth the extra code.

(If you want to learn more about how the Marpa algorithm compares against other parsing algorithm, you can read my Overview of the Marpa Parser.)

Here, we will use the Scanless Interface for Marpa, which adds lexing capabilities to the BNF. To create a new grammar, we have to say:

$grammar = Marpa::R2::Scanless::G->new({ %options })

One of these options is the BNF source, which has to be passed as a reference.

Writing the BNF

We want to write a grammar that is able to correctly parse the example data

If ((Myvalue.xyz == 1) Or (Frame_1.signal_1 == 1)) Then a = 1 Else a = 0;

We see that the program is made from statements that are separated by semicolons. We can express this with the BNF rule

StatementList ::= Expression+ separator => SEMICOLON
SEMICOLON ~ ';'  # Tokens are declared with '~'

An expression can be an If/Then/Else, a parenthesized expression, a binary operator, an identifier, or a numeric literal. We should choose a sensible precedence for all these possibilities. E.g. parenthesization and literals should have very high precedence. The conditional should be lower so that it can include other expressions. For the binary operators, I have chosen the precedence order:

  1. == equality test
  2. = assignment
  3. Or logical operator

Marpa Scanless allows us to write simple alternatives with |, and || to provide prioritized rules. The Marpa || works roughly like | in regexes with respect to backtracking, i.e. the first alternative is tried “first”. The BNF for the expression would be:

Expression ::=
    ('(') Expression (')')  assoc => group
|   NUMBER
||  IDENT
||  Expression  ('==')  Expression
||  Expression  ('=')   Expression
||  Expression  ('Or')  Expression
||  Conditional

Tokens enclosed in parenthesis like ('...') are matched, but their value is discarded in the AST.

If the rule we are defining appears within an alternative, then we can specify the associativity of that grammar production. The parens (...) should use group associativity because it can contain expressions with lower precedence. Left associativity is the default, so a Or b Or c will be parsed as (a Or b) Or c.

The other referenced rules are:

Conditional ::=
    ('If') Expression ('Then') Expression
|   ('If') Expression ('Then') Expression ('Else') Expression

IDENT   ~ ident
NUMBER  ~ <number int> | <number rat>

word         ~ [\w]+  # Perl charclasses can be used for tokens
ident        ~ word | ident '.' word # allow complex identifiers
<number int> ~ [\d]+
<number rat> ~ <number int> '.' <number int>

As a naming convention, I use:

  • CamelCase names for main grammar rules (declared with ::=),
  • UPPERCASE for tokens (declared with ~ but used in top-level rules), and
  • lowercase for subtokens (declared with ~ but used in tokens or subtokens).

The Marpa lingo, those correspond to:

  • non-terminal symbols in the G1 grammar,
  • terminal symbols in the G1 grammar, which are top-level non-terminal symbols of the L0 grammar, and
  • other non-terminal symbols in the L0 grammar.

We can instruct Marpa to start matching at the “top” with the pseudorule :start ::= StatementList, and can allow the lexer to skip whitespace between tokens via

:discard ~ ws
ws ~ [\s]+

In the next section, we take a short look at Object Oriented Programming with Perl and will then circle back to see how Marpa interacts with that.

Intermission: Perl OO

An object is a thingy which we can ask to do something for us. Depending on what we are asking it to do, it will either throw an error or fulfill our request. This can simplify code because we don’t care how the object works on the inside, or where it knows from how to perform the task.

Perl uses class-based OO. We can declare a class with the package keyword. All subs in that package can then be used as methods. We can inherit methods from another package via use parent 'Parent::Class'. When a sub is used as a method, the object it was invoked on is passed as the first argument. This invocant is usually called $self in proud Smalltalk tradition. The invocant may also be a string that holds the name of the package. That is, Perl makes no distinction between class methods and instance methods.

We can create a new instance of a class by blessing a reference into that class. Any reference will do. Usually, hash references are used where each hash entry is a field of the object. But here we will focus on array references that contain a collection of data.

For demonstration purposes, here is a small example to show blessing and inheritance. The below classes allow us to calculate the statistical mean and variance of a data set.

# Call the `new` method on class `Data::Variance`:
my $object = Data::Variance->new(1.1, 1.9, 2.1);

# Call the `variance` method on the object:
say "Var: ", $object->variance;
# Call the inherited `mean` method on the object:
say "Mean:", $object->mean;

# Define the `Data::Mean` class which provides the `mean` method
BEGIN {
  package Data::Mean;

  sub new {
    # The invocant for `new` should be the class name
    my ($class, @data) = @_;
    # bless a reference to the data into the provided class
    my $self = bless \@data => $class;
    # return the freshly baked object
    return $self;
  }

  # calculate the mean
  sub mean {
    my ($self) = @_;
    my $sum = 0;
    $sum += $_ for @$self;
    return $sum / @$self;
  }
}

# Define the `Data::Variance` class which inherits from `Data::Mean`:
BEGIN {
  package Data::Variance;

  # Specify inheritance of base classes.
  # The `-norequire` option is used because the parent is in the same file.
  use parent -norequire, 'Data::Mean';

  # `new` is inherited
  # `mean` is inherited

  # calculate the variance, bootstrapping it with the `mean`
  sub variance {
      my ($self) =    @_;
      my $mean = $self->mean;
      return Data::Mean->new(map { ($_ - $mean)**2 } @$self)->mean;  # nifty math
  }
}

Output:

Var: 0.186666666666667
Mean:1.7

In that example, the reference is blessed in the new method (the constructor). This is considered good style, but we can bless any reference anywhere into any class (the class doesn’t even have to exist). This is what we’ll let Marpa do in the next step.

Marpa’s bless adverb

We can let Marpa take the array of matched values, and bless them into a class of our choosing. We can do so by augmenting a rule with the bless adverb, e.g.

StatementList ::= Expression+ separator => SEMICOLON bless => Block

But we also have to specify how these values are obtained. We can do so by setting default attributes like

:default ::= action => [values]

We could have also written an action ourselves that takes the matched values as arguments and returns some data structure. The above default action is similar to:

sub { return [@_] }

We can specify a class root for the bless adverb. If we construct the grammar with the bless_package option, e.g. bless_package => 'Ast', then the StatementList would not be blessed into the Block class but into Ast::Block.

Similar classes are added to the other rules.

In some cases of our grammar, we do not want to use the [values] action. For example, the Expression production to Conditional just delegates to another rule. Similarly, the rule for parenthesized expressions just recurses into the Expression rule. In these cases, we will use the built-in action ::first:

Expression ::=
      ('(') Expression (')')  assoc => group action => ::first
  ...
  || Conditional  action => ::first

Obtaining a parse

This is the general workflow to obtain a parse from a Marpa::R2::Scannless::G Grammar object:

  1. Create a recognizer for that grammar:

    my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });
  2. Read the input string:

    $recce->read(\$input);
  3. Look at the value of the parse. The parse failed if it is undef. Otherwise, it is a reference to our AST:

    my $value = $recce->value;
    defined $value or die "Parse failed";
    my $ast = $$value;

The downloadable module Parser.pm implements the above grammar. The dump-ast.pl script uses that module to parse the example input, and prints out the resulting data structure.

Here is the output of the AST using Data::Dump:

bless([
  bless([
    bless([
      bless([
        bless(["Myvalue.xyz"], "Ast::Var"),
        bless([1], "Ast::Literal"),
      ], "Ast::Equals"),
      bless([
        bless(["Frame_1.signal_1"], "Ast::Var"),
        bless([1], "Ast::Literal"),
      ], "Ast::Equals"),
    ], "Ast::Or"),
    bless([bless(["a"], "Ast::Var"), bless([1], "Ast::Literal")], "Ast::Assign"),
    bless([bless(["a"], "Ast::Var"), bless([0], "Ast::Literal")], "Ast::Assign"),
  ], "Ast::Cond"),
], "Ast::Block")

Here is the output using Data::TreeDumper:

ast blessed in 'Ast::Block'
`- 0 =  blessed in 'Ast::Cond' 
   |- 0 =  blessed in 'Ast::Or' 
   |  |- 0 =  blessed in 'Ast::Equals' 
   |  |  |- 0 =  blessed in 'Ast::Var' 
   |  |  |  `- 0 = 'Myvalue.xyz' 
   |  |  `- 1 =  blessed in 'Ast::Literal' 
   |  |     `- 0 = '1' 
   |  `- 1 =  blessed in 'Ast::Equals' 
   |     |- 0 =  blessed in 'Ast::Var' 
   |     |  `- 0 = 'Frame_1.signal_1' 
   |     `- 1 =  blessed in 'Ast::Literal' 
   |        `- 0 = '1' 
   |- 1 =  blessed in 'Ast::Assign' 
   |  |- 0 =  blessed in 'Ast::Var' 
   |  |  `- 0 = 'a' 
   |  `- 1 =  blessed in 'Ast::Literal' 
   |     `- 0 = '1' 
   `- 2 =  blessed in 'Ast::Assign' 
      |- 0 =  blessed in 'Ast::Var' 
      |  `- 0 = 'a' 
      `- 1 =  blessed in 'Ast::Literal' 
         `- 0 = '0' 

And here is the AST illustrated as a sideways tree (written manually with Unicode box drawing characters):

                            ┌╴Var╶╴'Myvalue.xyz'
                    ┌╴Binop╶┼╴'=='
                    │       └╴Literal╶╴'1'
            ┌╴Binop╶┼╴'Or'
            │       │       ┌╴Var╶╴'Frame_1.signal_1'
            │       └╴Binop╶┼╴'=='
            │               └╴Literal╶╴'1'
            │
            │       ┌╴Var╶╴'a'
Block╶╴Cond╶┼╴Binop╶┼╴'='
            │       └╴Literal╶╴'1'
            │
            │       ┌╴Var╶╴'a'
            └╴Binop╶┼╴'='
                    └╴Literal╶╴'0'

We can see that the Data::Dump output is much harder to read than the tree. This is mainly because it contains too much superfluous information, and presents the information in a weird order. In a moment, we will try to create our own output for AST debugging.

Working with the AST

Before we can start working with the syntax tree Marpa creates for us, we should take a moment to create our AST classes, and write a few accessor method so that we don’t have to hardcode indices of the underlying arrayrefs. This is fairly boring:

package Ast;  # a parent class to define common methods

# a constructor for convenience.
# Marpa won't use this but rather `bless` directly
sub new {
    my ($class, @args) = @_;
    bless \@args => $class;
}

package Ast::Binop;  # convenience base for all binary operators
use parent -norequire, 'Ast';

sub l   { shift()->[0] }
sub r   { shift()->[1] }

package Ast::Equals;
use parent -norequire, 'Ast::Binop';

package Ast::Assign;
use parent -norequire, 'Ast::Binop';

package Ast::Or;
use parent -norequire, 'Ast::Binop';

package Ast::Var;
use parent -norequire, 'Ast';

sub name { shift()->[0] }

package Ast::Cond;
use parent -norequire, 'Ast';

sub cond { shift()->[0] }
sub then { shift()->[1] }
sub else { shift()->[2] }

package Ast::Block;
use parent -norequire, 'Ast';

package Ast::Literal;
use parent -norequire, 'Ast';

sub val { shift()->[0] }

We will now look at a few ways to transform the AST.

Transformation: cloning

It can sometimes be useful to make a deep copy of a data structure. This is fairly easy for each AST node. To copy an AST node, it

  1. clones all child nodes, or copies the values if they aren’t a regular node (we call such nodes a terminal node).
  2. blesses an array reference containing those copies into the correct class.

To determine whether a given child node is a terminal node, we ask it if it can perform the clone method. We can do so with the universal can method. I.e. $node->can('clone') does this check.

However, it is a fatal error to call a method on an unblessed reference (something that isn’t an object). Therefore, we first do a check whether the $node is an object at all. The blessed method from Scalar::Util comes in handy here.

Therefore, we can perform cloning like:

package Ast;
use Scalar::Util qw/blessed/;
...

sub clone {
    my ($self) = @_;
    my @childs = map { _clone_if_possible($_) } @$self;
    return ref($self)->new(@childs);
    # ref $self is the class of $self.
}

sub _clone_if_possible {
    my ($maybe_ast) = @_;
    if (blessed($maybe_ast) and $maybe_ast->can('clone')) {
        return $maybe_ast->clone;
    }
    return $maybe_ast;
}

Note: This technique assumes that the given data structure is a tree alright, and doesn’t contain loops (that would be a cyclic graph, not a tree).

I have prepared a script that clones an AST tree.

Transformation: AST dump (aka. compilation)

As mentioned above, the Data::Dump output is hard to read. Therefore, we will create a method that pretty-prints a part of the AST. The output will generally look like:

Ast::Something(
  childnode_A(
    ... )
  'some literal'
  ${a.variable} )

Our prettyprint() method will take an optional argument that indicates the indentation level. First, we add the general implementation:

package Ast;
...
sub prettyprint {
    my ($self, $indent) = @_;
    $indent //= 0;  # initialize $indent if no value passed
    $indent++;      # increment indent level
    my $items = join "\n",         # concatenate items with newline
        map { "  "x$indent . $_ }  # pad with intendation
        map { _prettyprint_if_possible($_, $indent) }
        @$self;
    my $type = ref $self;
    return "$type(\n$items )";
}

sub _prettyprint_if_possible {
    my ($maybe_ast, $indent) = @_;
    if (blessed($maybe_ast) and $maybe_ast->can('prettyprint')) {
        return $maybe_ast->prettyprint($indent);
    }
    return $maybe_ast;
}

For Ast::Literal and Ast::Var, we provide special implementations:

package Ast::Literal;
...
sub prettyprint {
    my ($self) = @_;
    return q(") . $self->val . q(");
}

package Ast::Var;
...
sub prettyprint {
    my ($self) = @_;
    return '${' . $self->name . '}';
}

I have made a script that adds prettyprinting. When the prettyprint method is called on the dumped AST, we get the output:

Ast::Block(
  Ast::Cond(
    Ast::Or(
      Ast::Equals(
        ${Myvalue.xyz}
        "1" )
      Ast::Equals(
        ${Frame_1.signal_1}
        "1" ) )
    Ast::Assign(
      ${a}
      "1" )
    Ast::Assign(
      ${a}
      "0" ) ) )

But what does that have to do with compilation?

What I have done here was defining an output language, and then compiling the AST to that language. I can easily add further implementations that change how the AST is compiled. E.g, we could add an if-then-else for the conditional like

package Ast::Cond;
...
sub prettyprint {
    my ($self, $indent) = @_;
    $indent //= 0;
    my ($cond, $then, $else) =
        map { $_->prettyprint($indent) } @$self;
    return "if $cond\n"
      . "  "x$indent . "then $then\n"
      . "  "x$indent . "else $else";
}

This would produce the output

Ast::Block(
  if Ast::Or(
    Ast::Equals(
      ${Myvalue.xyz}
      "1" )
    Ast::Equals(
      ${Frame_1.signal_1}
      "1" ) )
  then Ast::Assign(
    ${a}
    "1" )
  else Ast::Assign(
    ${a}
    "0" ) )

The output format need not be a pretty-printed output for debugging, but it can also be another target language, such as C. Because the details depend on what kind of input you are dealing with and other constraints, I’m leaving that as an exercise for the reader. But the principle is the same: you would define a method that performs the correct transformation of your AST nodes, and then you transform your tree recursively, from the bottom up.

Here, the conditional might be translated to a C ternary conditional like this:

package Ast::Cond;
...
sub compile_c_expression {
    my ($self) = @_;
    my ($cond, $then, $else) =
        map { $_->compile_c_expression } @$self;
    return "($cond ? $then : $else)";
}

So the method looks very similar. In fact, this C transpiler looks even simpler because it doesn’t track the correct indentation level.

Transformation: AST modification (aka. optimization)

Arguably more difficult than flattening the AST to some representation is modifying the structure of the syntax tree. This is due mostly to Perl’s type system which doesn’t lend itself to these tasks. (Generally, strongly typed functional languages like Ocaml or Haskell tend to excel here due to very expressive pattern matching.)

In the original Stack Overflow question, a part of the problem was factoring the assignment from a common variable out of the conditional branches.

We have to match this schema:

(cond) ? a = 1 : a = 0

And turn it into this format:

a = (cond) ? 1 : 0

To do that, we have to assert:

  1. That both branches are assignments. In our model, that means they are Ast::Assign instances.
  2. That the lvalue of both assignments is a variable (Ast::Var instance).
  3. That this variable has the same name.

Then we can factor out the assignment, and will assign the result of the conditional. Expressed as Perl code, given a branch $branch, we want this condition to be true:

$branch->isa("Ast::Assign") && $branch->l->isa("Ast::Var")

After that has been checked, the last condition can be expressed as $then->l->name eq $else->l->name. The complete simplify method for Ast::Cond would then be:

package Ast::Cond;
...
sub simplify {
    my $self = shift;
    my ($cond, $then, $else) =
        map { $_->simplify } @$self;

    if (    $then->isa('Ast::Assign') and $then->l->isa('Ast::Var')
        and $else->isa('Ast::Assign') and $else->l->isa('Ast::Var')
        and $then->l->name eq $else->l->name
    ) {
        return Ast::Assign->new(
            $then->l,
            Ast::Cond->new($cond, $then->r, $else->r),
        );
    }

    # else: just do what would have been done by default
    return $self->SUPER::simplify;
}

All other nodes inherit a simplify from Ast that does

sub simplify {
    my $self = shift;
    my @childs =
      map { blessed($_) && $_->can('simplify') ? $_->simplify : $_ }
      @$self;
    return ref($self)->new(@childs);
}

In other words, a simplify() on an AST that does not contain a conditional where the common subexpression is found has the same result as a clone on the AST.

I implemented this in another script. If the simplify() method is invoked on our AST, we get the following output:

Ast::Block(
  Ast::Assign(
    ${a}
    if Ast::Or(
      Ast::Equals(
        ${Myvalue.xyz}
        "1" )
      Ast::Equals(
        ${Frame_1.signal_1}
        "1" ) )
    then "1"
    else "0" ) )

Closing remarks

Creating an AST and compiling it down to some representation may seem somewhat hard, but it isn’t rocket science either.

In the above examples, simplicity and code size were a constraint. In a real-world application, the grammar would usually be more extensive. Here, the grammar was tailored to parse the example input, but not much more. For more complicated grammars the AST Marpa gives us and the AST we want might be slightly different. That can be fixed with custom actions.

And the AST classes would be more feature-rich. In particular, the abstract classes like Ast or Ast::Binop would refuse to be instantiated directly.

There are also some fine points when compiling to a C-like language that are not discussed in this post.

  • The AST is unambiguous and has no concept of precedence. When emitting output code, it may be helpful to surround every expression with parentheses even when a human programmer knows they are unnecessary.

  • Here we assume the conditional only contains expressions, but a conditional might also contain statements. In C, statements and expressions are very distinct, and (aside from a GCC extension) it is not possible to put statements inside expressions, like do {...} allows in Perl.

  • It is also sometimes necessary to extract an expression as a separate statement. That usually requires a variable name to be generated.

  • These generated symbols must not clash with other variables, so now we need some function to generate unique variable names.

  • If our C code uses pointers, a direct translation from the source language may be unsafe because it may assume a garbage-collected runtime.

The semantic mismatch between two languages is usually much deeper than syntactic differences. Parsing and AST manipulations are comparatively simple, especially with great tools like Marpa. That is why transforming syntax is the easy part of a compiler. The difficult part is translating semantics from the input language to the target language.