Parsing pervades all of computing. Whenever we are dealing with some data format or a programming language, something needs to parse that. But writing a good grammar for a given language, and picking a suitable parser is not always easy.

For example, let's look at some strings in an example language:

   *
  *=*
*=*=*=*

We could describe this in natural language as:

  • There can be a sole star.
  • Or there can be stripes with stars on either side.
  • Or there can be stripes with stars and stripes on either side

This description might prompt us to express this with the following grammar:

StarsAndStripes
  ::= '*'
  |   StarsAndStripes '=' StarsAndStripes

That is correct BNF, and we can see that we can create the string *=*=*=* through the derivation (*=*)=(*=*). However, this grammar is wildly ambiguous! There are five possible parses for that string in this grammar:

  * = (*  = (*  = *))
  * =((*  =  *) = *)
 (* =  *) = (*  = *)
 (* = (*  =  *))= *
((* =  *) =  *) = *

So while this grammar is correct, it is unpractical. Of course, we can note that although StarsAndStripes is recursive, the recursion must terminate with the '*' case. We can therefore substitute either the left or right reference to StarsAndStripes with '*' at no loss of correctness:

StarsAndStripes
  ::= '*'
  |   '*' '=' StarsAndStripes

These two grammars recognize the same language, but this latter grammar is unambiguous. However, it no longer matches our natural-language description. In fact, it can't generate the derivation (*=*)=(*=*) but only the derivation *=(*=(*=*)).

In this instance, this grammar rewrite is safe and sensible, but in other cases rewriting the grammar is much more difficult since it potentially changes the whole structure of the resulting parse tree.

There now are various classes of parsers. We are only interested in parsers for context-free grammars (CFGs): regular expressions are not generally powerful enough to be interesting, and more powerful kinds of grammars (unrestricted grammars, context-sensitive grammars) are too complex to be parsed efficiently.

  1. Some parsers use a top-down approach; their grammars can generally be classified as LL(k)LL(k).
  2. Others use a bottom-up method. These are LR(k)LR(k) grammars and the important class of LALRLALR grammars.
  3. Earley's algorithm and its offspring Marpa include both top-down and bottom-up elements, and can parse any CFG.

Top-Down: LL(k), Recursive Descent, PEG

Top-down parsers are fairly intuitive: given a production P ::= a b c, to parse P we first have to parse a, then b, then c. Then the we are done and return the result. This algorithm is so simple that it's usually implemented manually with a procedural “Recursive Descent” approach. Due to its flexibility, this is the most widely-used algorithms in hand-coded parsers, and it is easy to embed custom logic (e.g. symbol table handling) into the parser.

However, it's not all sunshine.

  • Naive implementations struggle with recursion, e.g. in P ::= P b c | a. That can easily throw them into an infinite loop. Of course, the grammar can be rewritten to avoid this.

  • A top-down approach struggles with ambiguity, where we do not know which production of multiple possible choices we should use. For example, the grammar P ::= a | b can be resolved by only looking at the next token (LL(1)LL(1)) to see which alternative should be chosen. But to resolve the grammar P ::= a b | a c we need a lookahead of two tokens (LL(2)LL(2)). If the alternatives begin with a reference to another production instead of a simple token, things get difficult. The LL()LL(*) class can handle all grammars where the lookahead can be described by a regular language.

    If none of these approaches work, we can give up. That's surprisingly sane and guarantees that if the parser parses something, it will be fast.

    Or we can take a guess and see if it works out. The problem is that if something doesn't work out, we need to backtrack our steps to the last choice point and continue with the next best alternative. This works, but can be terribly slow: the parser can go exponential.

Parsing Expression Grammars (PEGs) are essentially a formalisation of recursive descent parsers. Whereas in CFGs different alternatives are not prioritized, PEGs only have an ordered choice. This allows them to be simple and efficient, but they make it more difficult to reason about the language which is actually parsed by a grammar. E.g. the parsing expression (a/aa)a recognizes aa but not aaa, while (aa/a)a recognizes aaa but not aa: in both cases the first alternative matches and consumes part of the input. The second alternative can never be considered.

Scorecard:

complexity : O(n)O(n) typical, O(exp n)O(\mathrm{exp}\ n) worst case with backtracking

ambiguity : commits to one alternative, optional backtracking

error reporting : aware of the current rule – but typically only gives you the first or last error, not all errors at this position

state of the art : PEG/Packrat

notable implementations : ANTLR, Parsec, OMeta, PEG.js

Bottom-Up: LR(k), LALR

Bottom-up parsing techniques look at the stream of input tokens and try to recognize patterns. When a known pattern matches, the tokens are combined and substituted by an AST fragment. This AST fragment may in turn be a part of some pattern. This is very much like top-down parsing, except backwards.

LR parsers solve many problems of LL parsers. For example, LR parser can handle a broader class of grammars. Recursive rules aren't a problem any more since we will necessarily match the pattern for the base case before we can match the pattern of the recursion.

For all that LR seems convenient, it still has a number of theoretical and practical problems.

  • Ambiguity absolutely cannot be handled, and there's no workaround such as backtracking. When an LR parser finds that two rules can match the same input, it cannot continue. Generalized LR parsers (GLR) manage to parse all CFGs including ambiguous grammars by splitting their stack of tokens, but their worst-case complexity rises to O(n3)O(n^3), and all other adverse properties of LR parsing remain.

  • LR parsers are not aware of their context. They must decide what to do based on the input at hand. This is a problem if a grammar has two rules that would match the same token sequence, but is not ambiguous because the grammar guarantees these cannot collide. This is relevant when we want to compose grammars.

  • LR parsers give shitty error messages. They are aware of what tokens are acceptable in the current position, but they don't know yet what we're trying to parse. An LR parser will happily tell you that you need to provide a plus or a minus, but it doesn't know you were trying to write a subtraction or addition.

  • LR parsers are hell to debug. No wonder, since they don't know what they are doing until they've done it.

  • Rewriting grammars to fit the restrictions of LALR is annoying.

  • LR parsers require complicated state transition tables. Thankfully, there are tools that will compile these tables for you. As a result, no one writes LR parsers by hand. This is both more convenient, but also makes it more difficult to add custom behaviour or little tweaks to the parsing process.

The LALR subset represents the most important class of CFGs, since most programming languages can be (and are) parsed by an LALR parser.

Scorecard:

complexity : O(n)O(n), with O(n3)O(n^3) worst case for GLR

ambiguity : cannot handle ambiguities, with the exception of GLR which can parse anything.

error reporting : only aware of expected tokens, not aware of current rule

state of the art : LALR, GLR

notable implementations : Bison, YACC

Earley and Marpa

The Earley parser combines the top-down and bottom-up approach. The parser maintains a set of possible rules at the current parse location. This set contains all alternatives; therefore ambiguities don't present a problem. When we read a token, we discard all alternatives that weren't matched, and advance the position in those rules that were matched. From there, we can again predict all possibilities at that position, and continue with the next token. Only when a rule was successfully completed we do apply that rule. So while prediction is essentially top-down, rule application is similar to bottom-up.

This combination turns out to be extremely powerful: The parser always knows what rules we're currently in, and also what tokens are expected. That's the best of LL and LR parsing combined. In fact, it's even better since we can parse all CFGs, not just those that can be expressed as LL or LR grammars.

Except that Earley's algorithm was a bit impractical for its time when it was first formulated. Compared to the simple state transitions of an LR parser, an Earley parser involves much more expensive operations. Also, the original algorithm contained a bug, and didn't have a very attractive algorithmic complexity for important categories of grammars. That made Earley very interesting for areas such as natural language processing where support for complicated grammars is more important than efficiency, but not for programming language implementation, where limited grammars are still useful but efficiency is paramount.

All of that changed when in 1991 Joop Leo figured out how to make the Earley algorithm run in linear time for every LR(k)LR(k) grammar. And when Aycock and Horspool fixed the remaining bug in 2002. Together, these improvements make the Earley parser (1.) as efficient as any LR(k)LR(k) parser or parsers for LRLR subsets, and (2.) just as powerful as GLR – both can parse arbitrary CFGs.

One the one hand, this is a lot of power, but this power also comes with a lot of restrictions. Because a recursive-descent parser is so extremely simple, it's also easy to extend with new features or custom logic. This is not generally the case with Earley parsers. Also, since Earley and GLR parsers return all valid parses, some assumptions that are valid in LL parsers do not hold. For example, we cannot assume that a predicted production will be completed, or that this is the only active alternative. This means that any plugged-in code during the parse should really be free of side effects. However, many advanced techniques such as maintaining a symbol table during the parse, or non-context-free language features such as indendation sensitive languages or here-docs do require mutation of parser state.

Scorecard:

complexity : O(n)O(n) for LR(k)LR(k) grammars, O(n2)O(n^2) for remaining unambiguous grammars, O(n3)O(n^3) worst case for arbitrary CFGs

ambiguity : can handle any local or global ambiguity

error reporting : total awareness: knows all possible current rules and all acceptable tokens

state of the art : improvements by Leo, Aycock, Horspool; synthesis in Marpa algorithm

notable implementations : Marpa

Marpa advantages

The Marpa algorithm combines the Earley parser with all known improvements, and its reference implementation serves as an interesting testbed for advanced parsing techniques. Some abilities are shared by other parsers, but the overall mix is intruiging.

Manual Lexing

With top-down parsing, it's easy to take over parsing manually - in fact, it's what you'll be doing anyway with recursive descent. This means that it's trivially possible to nest languages. With Marpa, we can pause the built-in scanner, and switch to an arbitrary parser or lexer of our own. This may even be another Marpa parser! The real power comes from the ability to read a sequence of tokens into the Marpa recognizer instead of just a single token representing the sub-parse. This is fundamentally possible with other parser technologies as well, but Marpa does it really well.

Ruby Slippers Parsing

Real-life grammars can get insanely complicated, especially when we consider oddities such as optional tags in HTML, or automatic semicolon insertion in some languages. In the case of HTML, this can be done with a vastly more complicated parser that encodes both the case with a tag present and with a tag elided. Automatic semicolon insertion is usually handled on a tokenizer level, where we insert a semicolon depending on the surrounding tokens (e.g. at the end of line unless it ended in a comma or backslash).

With Marpa, we only need to encode that “happy case” in our grammar. The parser will then fail, but we can handle that case. Remember that Earley parsers know very precisely what input would be needed at this location for a successful parse. The error handler can therefore check whether this context is eligible for a fix-up, and provide the missing token or tokens. The normal parse can then be resumed afterwards.

Marpa's author Jeffrey Kegler calls this “Ruby Slippers Parsing”. In the Wizard of Oz, the character Dorothy dearly wishes to return home from her adventure, but is unable to do so. After she taps the heels of her ruby slippers together, she is magically returned home. This is analogous to a Marpa parser: when the input does not permit a successful parse, we can fix up the input and set the parser off at a valid state again.

Further reading: Jeffrey's original post about Ruby Slippers.

Longest Acceptable Token Matching

Usually, tokenizers do Longest Token Matching (LTM) – when the input at the current position could be read as various different tokens, the longest of these possible tokens would be selected. Most tokenizers can also handle priorities between different tokens to break ties (e.g. between identifiers and keywords), and Marpa's Scanless Interface (SLIF) could also accept different tokens of same length at the same position.

Unfortunately, this runs into problems when considering sub-languages that have different tokens. An example grammar is an INI style file where each line has a key and a value:

key: value

The key is anything up to the first :, the rest of the line is an associated value. We might be tempted to use a grammar such as

Assoc ::= key separator value newline
key ~ [^:]+
value ~ [^\n]+
separator ~ ': '
newline ~ [\n]

But LTM makes that blow up: at the beginning of the line, we try to match each token (key, value, separator, newline). Obviously, value is the longest token since it matches the whole line, not key which is the only token that would be accepted by the grammar.

You can read this Stack Overflow question for a more detailed discussion of the problem.

Longest Acceptable Token Matching fixes this. Marpa knows beforehand which tokens are possible at this position, so it asks the tokenizer to only consider these acceptable tokens. For this reduced set, we can then do traditional longest-token matching.

In principle, this could be done with any LR-based parser, but I'm only aware of the implementation in Marpa's scanless interface.

Grammar Debugging

Earley parsers necessarily keep track of a lot of information that LL or LR parsers don't. When debugging a grammar, we can inspect this information. Marpa has various report formats, although reading these requires a bit of experience. Notably, Marpa can report on all alternatives that are valid at this position, and can tell us which tokens and rules are predicted for the next position. This is indispensable when a grammar features an unexpected local ambiguity. This will cause multiple valid parses to be available, rather than just the one parse we expected. Using Marpa's progress report, we can figure out where the ambiguity was introduced.

As an aside, it would be interesting to see a tool that hooks into a Marpa parser and animates the progress and the change of prediction sets, which would make Marpa debugging far more user-friendly.

Take Grammars Directly From The Specification

Many technical standards or specifications for languages, protocols, or data formats include some kind of grammar. Often, this grammar is only intended for implementing programmers and is not practically parsable by commonly used tools. However, Marpa can parse any BNF grammar with only syntactic changes. Unlike with LL or LR parsers, the whole structure of the grammar does not have to be changed.

This has two big advantages: Marpa's universality makes it well suited for reference implementations. These reference implementations can also serve as a testbed to find any ambiguities still lurking inside the syntax of a proposed standard. Secondly, using Marpa for an implementation of a standard makes it less likely that the act of implementing the grammar introduces bugs. This would be a very real possibility when implementing a BNF-specified grammar with a PEG parser since that changes the grammar's interpretation, or when rewriting a grammar to fit inside the limitations of a specific technology such as LALR.

Efficient Handling of Highly Ambiguous Parses

Marpa is one of very few practical parsers that embrace ambiguity. The problem with a highly ambiguous parse is that there are many valid interpretations, often too many to practically iterate.

In some cases, Marpa's Abstract Syntax Forests can help, as they expose all choice points to the programmer. This is especially interesting in the context of natural language processing where interpretation choices are made based on heuristics and probabilities.

Marpa weaknesses

Of course, the Marpa and Earley algorithms don't only have strengths. There's also a nice list of points that might turn you away from Marpa for some use cases:

No support for streaming parsing

For very large documents, it is impractical to build a complete parse tree. This is commonly the case with XML databases (which are a WTF in themselves). Marpa does not currently support a streaming mode of operation, but please note that such a mode is incompatible with global ambiguities. However, Marpa can be configured to emit various events during the recognizer phase that could be used to do something like streaming parsing.

High Memory Use

Earley parsing involves a lot of space for the precomputed tables and the runtime Earley sets. When we parse on a character-by-character basis, megabyte-size documents start to become impractical. Therefore, Marpa's Scanless Interface (SLIF) uses a separate parser for character-by-character scanning, which cooperates with the main token-level grammar. This split leads to some minor impedance if one is not aware of the specifics.

However, this might possibly change for low-ambiguity parses: While Marpa currently first has a recognizer phase and then assembles the parse tree(s), these phases could be intermixed, similar to the modus operandi of an LR parser.

Lacking Conceptual Simplicity

The Earley algorithm is comparatively difficult to understand, and the enhancements implemented by Marpa don't make it simpler. This complexity makes it difficult to reason about Marpa. I currently treat Marpa as a very opaque black box that seems to do what I ask of it. I believe conceptual simplicity is one of the reasons why PEGs are currently fairly popular even though they have no technical advantages.

Difficult to Extend

There are two ways to extend a Marpa grammar: You can either hack new functionality into the library itself (as it would be needed for some features such as layout parsing), or you can use the event system to perform side effects, context-sensitive tests, or external scanning. Neither way is straightforward. In particular, the event system is very inconvenient. A conceptually simple thing such as temporarily switching to manual lexing or maintaining an external stack is very difficult to pull of correctly, unless you know your grammar is free from local ambiguity in the relevant section.

I have great hopes that the upcoming Kollos interface will bring clarity here, which will enable easier experimentation. Right now, the interface is calibrated for maximum flexibility; but most of that flexibility is ballast in the majority of simple or intermediate use cases.

Conclusion

Marpa is seriously awesome: it combines the advantages of all commonly used algorithms with acceptable performance and unique abilities. It is very well suited for both simple and very ambitious use cases, although the interface is not very convenient for the middle ground. Marpa is indispensable for developing well-thought out languages, and it is a solid choice in both R&D or production projects with all but the most demanding performance requirements.

You can learn more about Marpa at: