Transforming Syntax

Or: how to write the easy part of a compiler

A Stack Overflow question asked how to translate a VB-like conditional into a C-like ternary. The other answers suggested regexes or treating it as Perl code *shudder*. But transpiling code to another language can be done correctly.

This post aims to cover:

• parsing with Marpa::R2,
• AST manipulation,
• optimization passes,
• compilation, and
• Perl OO.

In the end, we’ll be able to do all that in only 200 lines of code!

Since this post is already rather long, we will not discuss parsing theory. You are expected to be familiar with EBNF grammar notation.

Note: All runnable code snippets should be prefixed by

Contents:

Problem statement: How to translate an expression into a C-like language

This tutorial was prompted by a question on Stack Overflow by user2516265. Here it is, slightly edited:

I am writing a script to read an Excel file translate the text into C programing syntax.

So in the Excel sheet I have string like this:

If ((Myvalue.xyz == 1) Or (Frame_1.signal_1 == 1))
Then a = 1 Else a = 0;

I am trying to create this output:

a = (((Myvalue.xyz == 1) || (Frame_1.signal_1 == 1)) ? 1 : 0)

How this can be handled in Perl?

Short introduction to Marpa

Various parser exists for Perl. Some of these are ancient tech, some are solid but slow, some are awesome. My favourite parsers are:

Regexp::Grammars
Provides many powerful and expressive tools. Retains the full power of Perl regexes. The downside: now everything is one huge regex.
Parser::MGC
Makes it trivial to write an object oriented top-down parser. However, the resulting code is unneccessarily verbose.
Recursive Descent
It is often best to write a RecDesc parser by hand. This affords a lot of flexibility, but can get tedious.
Marpa::R2
This is a wrapper around libmarpa. It is written in C, and uses Early parsing instead of top-down techniques. This makes it fairly easy to write arbitrary BNF, which Marpa will happily use. While it is strictly less powerful than regexes, Marpa provides excellent interfaces and good debugging help that often makes it worth the extra code.

(If you want to learn more about how the Marpa algorithm compares against other parsing algorithm, you can read my Overview of the Marpa Parser.)

Here, we will use the Scanless Interface for Marpa, which adds lexing capabilities to the BNF. To create a new grammar, we have to say:

One of these options is the BNF source, which has to be passed as a reference.

Writing the BNF

We want to write a grammar that is able to correctly parse the example data

We see that the program is made from statements that are separated by semicolons. We can express this with the BNF rule

StatementList ::= Expression+ separator => SEMICOLON
SEMICOLON ~ ';'  # Tokens are declared with '~'

An expression can be an If/Then/Else, a parenthesized expression, a binary operator, an identifier, or a numeric literal. We should choose a sensible precedence for all these possibilities. E.g. parenthesization and literals should have very high precedence. The conditional should be lower so that it can include other expressions. For the binary operators, I have chosen the precedence order:

1. == equality test
2. = assignment
3. Or logical operator

Marpa Scanless allows us to write simple alternatives with |, and || to provide prioritized rules. The Marpa || works roughly like | in regexes with respect to backtracking, i.e. the first alternative is tried “first”. The BNF for the expression would be:

Expression ::=
('(') Expression (')')  assoc => group
|   NUMBER
||  IDENT
||  Expression  ('==')  Expression
||  Expression  ('=')   Expression
||  Expression  ('Or')  Expression
||  Conditional

Tokens enclosed in parenthesis like ('...') are matched, but their value is discarded in the AST.

If the rule we are defining appears within an alternative, then we can specify the associativity of that grammar production. The parens (...) should use group associativity because it can contain expressions with lower precedence. Left associativity is the default, so a Or b Or c will be parsed as (a Or b) Or c.

The other referenced rules are:

Conditional ::=
('If') Expression ('Then') Expression
|   ('If') Expression ('Then') Expression ('Else') Expression

IDENT   ~ ident
NUMBER  ~ <number int> | <number rat>

word         ~ [\w]+  # Perl charclasses can be used for tokens
ident        ~ word | ident '.' word # allow complex identifiers
<number int> ~ [\d]+
<number rat> ~ <number int> '.' <number int>

As a naming convention, I use:

• CamelCase names for main grammar rules (declared with ::=),
• UPPERCASE for tokens (declared with ~ but used in top-level rules), and
• lowercase for subtokens (declared with ~ but used in tokens or subtokens).

The Marpa lingo, those correspond to:

• non-terminal symbols in the G1 grammar,
• terminal symbols in the G1 grammar, which are top-level non-terminal symbols of the L0 grammar, and
• other non-terminal symbols in the L0 grammar.

We can instruct Marpa to start matching at the “top” with the pseudorule :start ::= StatementList, and can allow the lexer to skip whitespace between tokens via

ws ~ [\s]+

In the next section, we take a short look at Object Oriented Programming with Perl and will then circle back to see how Marpa interacts with that.

Intermission: Perl OO

An object is a thingy which we can ask to do something for us. Depending on what we are asking it to do, it will either throw an error or fulfill our request. This can simplify code because we don’t care how the object works on the inside, or where it knows from how to perform the task.

However, it is a fatal error to call a method on an unblessed reference (something that isn’t an object). Therefore, we first do a check whether the $node is an object at all. The blessed method from Scalar::Util comes in handy here. Therefore, we can perform cloning like: Note: This technique assumes that the given data structure is a tree alright, and doesn’t contain loops (that would be a cyclic graph, not a tree). I have prepared a script that clones an AST tree. Transformation: AST dump (aka. compilation) As mentioned above, the Data::Dump output is hard to read. Therefore, we will create a method that pretty-prints a part of the AST. The output will generally look like: Ast::Something( childnode_A( ... ) 'some literal'${a.variable} )

Our prettyprint() method will take an optional argument that indicates the indentation level. First, we add the general implementation:

For Ast::Literal and Ast::Var, we provide special implementations:

I have made a script that adds prettyprinting. When the prettyprint method is called on the dumped AST, we get the output:

Ast::Block(
Ast::Cond(
Ast::Or(
Ast::Equals(
${Myvalue.xyz} "1" ) Ast::Equals(${Frame_1.signal_1}
"1" ) )
Ast::Assign(
${a} "1" ) Ast::Assign(${a}
"0" ) ) )

But what does that have to do with compilation?

What I have done here was defining an output language, and then compiling the AST to that language. I can easily add further implementations that change how the AST is compiled. E.g, we could add an if-then-else for the conditional like

This would produce the output

Ast::Block(
if Ast::Or(
Ast::Equals(
${Myvalue.xyz} "1" ) Ast::Equals(${Frame_1.signal_1}
"1" ) )
then Ast::Assign(
${a} "1" ) else Ast::Assign(${a}
"0" ) )

The output format need not be a pretty-printed output for debugging, but it can also be another target language, such as C. Because the details depend on what kind of input you are dealing with and other constraints, I’m leaving that as an exercise for the reader. But the principle is the same: you would define a method that performs the correct transformation of your AST nodes, and then you transform your tree recursively, from the bottom up.

Here, the conditional might be translated to a C ternary conditional like this:

So the method looks very similar. In fact, this C transpiler looks even simpler because it doesn’t track the correct indentation level.

Transformation: AST modification (aka. optimization)

Arguably more difficult than flattening the AST to some representation is modifying the structure of the syntax tree. This is due mostly to Perl’s type system which doesn’t lend itself to these tasks. (Generally, strongly typed functional languages like Ocaml or Haskell tend to excel here due to very expressive pattern matching.)

In the original Stack Overflow question, a part of the problem was factoring the assignment from a common variable out of the conditional branches.

We have to match this schema:

(cond) ? a = 1 : a = 0

And turn it into this format:

a = (cond) ? 1 : 0

To do that, we have to assert:

1. That both branches are assignments. In our model, that means they are Ast::Assign instances.
2. That the lvalue of both assignments is a variable (Ast::Var instance).
3. That this variable has the same name.

Then we can factor out the assignment, and will assign the result of the conditional. Expressed as Perl code, given a branch $branch, we want this condition to be true: After that has been checked, the last condition can be expressed as$then->l->name eq $else->l->name. The complete simplify method for Ast::Cond would then be: All other nodes inherit a simplify from Ast that does In other words, a simplify() on an AST that does not contain a conditional where the common subexpression is found has the same result as a clone on the AST. I implemented this in another script. If the simplify() method is invoked on our AST, we get the following output: Ast::Block( Ast::Assign(${a}
if Ast::Or(
Ast::Equals(
${Myvalue.xyz} "1" ) Ast::Equals(${Frame_1.signal_1}
"1" ) )
then "1"
else "0" ) )

Closing remarks

Creating an AST and compiling it down to some representation may seem somewhat hard, but it isn’t rocket science either.

In the above examples, simplicity and code size were a constraint. In a real-world application, the grammar would usually be more extensive. Here, the grammar was tailored to parse the example input, but not much more. For more complicated grammars the AST Marpa gives us and the AST we want might be slightly different. That can be fixed with custom actions.

And the AST classes would be more feature-rich. In particular, the abstract classes like Ast or Ast::Binop would refuse to be instantiated directly.

There are also some fine points when compiling to a C-like language that are not discussed in this post.

• The AST is unambiguous and has no concept of precedence. When emitting output code, it may be helpful to surround every expression with parentheses even when a human programmer knows they are unnecessary.

• Here we assume the conditional only contains expressions, but a conditional might also contain statements. In C, statements and expressions are very distinct, and (aside from a GCC extension) it is not possible to put statements inside expressions, like do {...} allows in Perl.

• It is also sometimes necessary to extract an expression as a separate statement. That usually requires a variable name to be generated.

• These generated symbols must not clash with other variables, so now we need some function to generate unique variable names.

• If our C code uses pointers, a direct translation from the source language may be unsafe because it may assume a garbage-collected runtime.

The semantic mismatch between two languages is usually much deeper than syntactic differences. Parsing and AST manipulations are comparatively simple, especially with great tools like Marpa. That is why transforming syntax is the easy part of a compiler. The difficult part is translating semantics from the input language to the target language.