m//gc Style Lexing With Perl

or: Five Parsing Techniques You Can't Do With s/// Substitutions

You are writing a simple lexer or parser in Perl? You'll probably use regexes. Here's how to use the little-known pos() function to correctly apply regexes.

Let's assume you have a string and are trying to tokenize it. For example, let's take this Lisp-like syntax:

use Test::More tests => 1;

my $input = '(foo bar (baz qux) quux)';
my $expected = [
  [PAREN_L => undef],
  [IDENT => 'foo'],
  [IDENT => 'bar'],
  [PAREN_L => undef],
  [IDENT => 'baz'],
  [IDENT => 'qux'],
  [PAREN_R => undef],
  [IDENT => 'quux'],
  [PAREN_R => undef],
];

is_deeply tokenize($input), $expected;

One way is to use s/^token-pattern//-substitutions to chop of the current token from the beginning of the string. Perl actually implements this fairly efficiently, but this destroys the original input source and requires us to operate on a copy.

Here's how we could implement that tokenizer with substitutions:

sub tokenize {
  my ($input) = @_;  # note the copy
  my @tokens;
  while (length $input) {
    if ($input =~ s/\A\s+//) {
      # skip
    }
    elsif ($input =~ s/\A[(]//) {
      push @tokens, [PAREN_L => undef];
    }
    elsif ($input =~ s/\A[)]//) {
      push @tokens, [PAREN_R => undef];
    }
    elsif ($input =~ s/\A([^()\s]+)//) {
      push @tokens, [IDENT => $1];
    }
    else {
      my $context = substr $input, 0, 20;
      die qq(Unknown token at "$context");
    }
  }
  return \@tokens;
}

With m//gc-style lexers, we do not need a copy, and can later put more advanced features into our tokenizer. The central concept of this technique is that in scalar context, a regex with the /g modifier does not match all occurrences, but instead continues matching from the location in the string where the last match left off. This is possible since every string records the last match position, which is gettable and settable through the pos builtin function.

However, the pos() property is reset to undef when a match fails. This behaviour can be suppressed with the /c flag, which allows us to try another possible token pattern at that position.

Finally, the \G assertion inside a regex can be used to anchor at the current pos location. Without it, the regex would start searching at the current pos, but it would be allowed to skip input until a match is found:

my $input = "foobar";
scalar($input =~ /oo/gc); # matches, "f" is skipped
scalar($input =~ /ar/gc); # matches as well, "b" is skipped
pos($input) = 0; # reset the location
scalar($input =~ /oo/gc); # matches, "f" is skipped
scalar($input =~ /\Gar/gc); # fails, since there's a "b" at current pos

So to summarize, you can think of \G as an \A (beginning of input) or ^ (beginning of input or beginning of line) assertion, except that it relates to pos rather than physical boundaries in the string.

We can now rewrite the above tokenizer with substitutions:

sub tokenize {
  # for-loops create an alias rather than a copy
  for my $input (shift @_) {
    my @tokens;
    pos($input) = 0; # set the position to the start – note that this mutates the original string!
    while (pos $input < length $input) {
      if ($input =~ m/\G\s+/gc) {
        # skip
      }
      elsif ($input =~ m/\G[(]/gc) {
        push @tokens, [PAREN_L => undef];
      }
      elsif ($input =~ m/\G[)]/gc) {
        push @tokens, [PAREN_R => undef];
      }
      elsif ($input =~ m/\G([^()\s]+)/gc) {
        push @tokens, [IDENT => $1];
      }
      else {
        my $context = substr $input, 0, 20;
        die qq(Unknown token at "$context");
      }
    }
    return \@tokens;
  }
}

Instead of using for-loops to alias the original string, you could also use one of the following techniques:

  • $_[0] instead of $input, as all elements in @_ are aliases
  • local *input = \$_[0], which uses a global variable $input. Don't do this.
  • use Data::Alias; alias my $input => $_[0], which is the best thing you could do
  • use Util::Underscore; _::alias my $input => $_[0], which just delegates to Data::Alias.

Parsing techniques that are not possible with substitutions

Now that we don't destroy the string once we find a token, we can implement various advanced techniques:

Better error reports

When an error occurs, the above tokenizers print out the next few characters at the error location. Since using the m//gc technique doesn't destroy all input, we can now print out the whole line where the error occurs. The only difficulty is locating the beginning of the line, which is most sensibly done by introducing a bit of line bookkeeping.

sub tokenize {
  for (shift @_) {
    pos = 0;
    my @tokens;
    my $line_no = 1;
    my $line_start = 0;  # the starting position of the current line
    while (pos() < length) {
      my $span_start = pos;  # the starting position of this token
      if (m/\G\s+/gc) {
        # skip
      }
      elsif (m/\G\+/gc) {
        push @tokens, 'PLUS';
      }
      elsif (m/\G\-/gc) {
        push @tokens, 'MINUS';
      }
      else {
        my $column = pos() - $line_start; # 0-based indexing
        my $error = sprintf "%d:%d: Unknown token\n", $line_no, $column + 1;
        # extract the current line and throw an error
        pos = $line_start;
        if (/\G^(.*?)$/gsm) {
          my $line = $1;
          die $error . "\n"
            . $line . "\n"
            . (" " x $column) . "^--\n";  # arrow points to first error char
        }
        else {
          die $error;
        }
      }

      # extract the span of this token
      my $span_end = pos;
      my $span = substr $_, $span_start, ($span_end - $span_start);
      # count newlines within the span
      while ($span =~ m/\R/g) {
        $line_no++;
        $line_start = $span_start + pos($span);
      }

      # continue with next token
    }
    return \@tokens;
  }
}

The above tokenizer handles strings such as ++- +. If we give it some invalid input with letters like this:

+ -
+-foo
- +

Then we are greeted by a nice error such as

2:3: Unknown token

+-foo
  ^--

Longest-token matching

It is frequently useful to not hardcode the order of all tokens in an explicit if-else cascade. Instead, we can put a couple of regexes into a dispatch table and let our program figure out the best match, with the best match usually being the longest match (e.g. for is read as the keyword for and not as the variable f, while fortress is a variable and not the keyword for)

The main advantage for this is that we can easily add new tokens, and can get rid of most of the token handling logic. In a nutshell, we attempt to match all possible tokens, but only keep that token which consumes most of the input.

For easier implementation, we assume that the token action coderefs have no side effects, so that the order of invocation and number of invocations do not matter.

sub tokenize {
  my $rules = shift @_; # arrayref of regexp-coderef pairs
  for (shift @_) {
    pos = 0;
    my @tokens;
    while (pos() < length) {
      my $pos = pos;
      my $max_len = 0;
      my $current_longest_token;
      my $any_token_matched = 0;

      for my $rule (@$rules) {
        my ($regex, $action) = @$rule;
        pos = $pos; # reset pos before each match
        next if not m/\G$regex/gc;  # skip tokens that don't match
        # careful: The above match may introce captures $1, $2, etc. that are
        # overwritten by the next regex match. Therefore, no matches must occur
        # until the $action is invoked.
        my $this_len = pos() - $pos;
        next if $this_len < $max_len;  # skip tokens that are too short
        if ($this_len == $max_len) {
          die "Conflicting longest tokens found";
        }
        # this is a new longest token – evaluate the action and update data
        $max_len = $this_len;
        $current_longest_token = $action->();
        $any_token_matched = 1;
      }
      if (not $any_token_matched) {
        die "No token matched";
      }
      # allow "undef" tokens to be ignored
      if (defined $current_longest_token) {
        push @tokens, $current_longest_token;
      }
      pos = $pos + $max_len;  # record starting position of next token
    }
    return \@tokens;
  }
}

Example usage:

use Test::More tests => 1;

my $input = '(foo bar (baz qux) quux)';
my $expected = [
  [PAREN_L => undef],
  [IDENT => 'foo'],
  [IDENT => 'bar'],
  [PAREN_L => undef],
  [IDENT => 'baz'],
  [IDENT => 'qux'],
  [PAREN_R => undef],
  [IDENT => 'quux'],
  [PAREN_R => undef],
];

my @rules = (
  [qr/\s+/ => sub { undef }],
  [qr/[(]/ => sub { [PAREN_L => undef] }],
  [qr/[)]/ => sub { [PAREN_R => undef] }],
  [qr/([^()\s]+)/ => sub { [IDENT => $1] }],
);

is_deeply tokenize(\@rules, $input), $expected;

It may be interesting to note that the majority of rule actions will be fairly boring and could be autogenerated according to some convention.

Longest acceptable token matching

As a variation of longest token matching, the tokenizer can be driven asynchronously by a parser that immediately consumes each token. In such a setup, the parser may know which tokens are possible at the current position, and ask the tokenizer for one of these tokens. This technique is fairly common in the Marpa parser, since the parser is completely aware of the current context during the parse.

Instead of returning a list of tokens, we return a callback that takes a list of possible token names and returnes matching tokens

sub make_tokenizer {
  my $rules = shift @_;
  my $input_ref = \shift @_;
  # preprocess the rules so that we can look up applicable tokens by name
  my %rules_by_name;
  for my $rule (@$rules) {
    my ($name, $regex, $action) = @$rule;
    push @{ $rules_by_name{$name} }, [$regex, $action];
  }

  return sub {
    my @acceptable_token_names = @_;
    my @rules = map { @{ $rules_by_name{$_} } } @acceptable_token_names, ' ';
    for ($$input_ref) {
      my $token;
      ... # the rest is ordinary longest-token-matching
      if (not defined $token) {
        die "No token found, expected @acceptable_token_names";
      }
      return $token;
    }
  };
}

The above implementation allows the same token to be provided by multiple rules, which is convenient when you have both single- and double-quoted strings that require different preprocessing but boil down to the same token type. The blank name marks rules that are always acceptable and that are not explicitly listed in the grammar. This is useful for white space rules if whitespace may occur between any tokens (as is the case in C). To avoid clashes, these rules are tested last.

The format of the input rules has shifted, and would now look like

my @rules = (
  [' '     => qr/\s+/ => sub { undef }],
  [PAREN_L => qr/[(]/ => sub { [PAREN_L => undef] }],
  [PAREN_R => qr/[)]/ => sub { [PAREN_R => undef] }],
  [IDENT   => qr/([^()\s]+)/ => sub { [IDENT => $1] }],
);

The parser uses this tokenizer in a manner like this:

my $tokenize = make_tokenizer(\@rules, $input);
...
my $token = $tokenize->(qr/IDENT PAREN_R/);
do_something_with($token);

Ambiguous tokens

Sometimes there may be multiple longest tokens. Ordinarily, these would be either ignored or would throw an exception. Some parsers can deal with such ambiguity, in which case returning all longest token is necessary. For example, a language might have for as both a keyword and a variable. Perl itself is such a language where you can define a method for even though it's a keyword:

sub for {
  my ($self) = @_;
  say "inside method for";
}

for (1..3) {  # "for" as keyword
  main->for();  # "for" as method
}

In practice, this would be more easily resolved by Longest Acceptable Token Matching, but without such context-awareness, the ambiguity has to be embraced at the tokenizer stage.

The implementation is fairly trivial – you maintain a list of longest tokens for this position, and return the whole list at the end. Whenever a longer token comes along, the list is cleared.

Context-sensitive tokens

One main advantage of not destroying your input as you go is that you can use lookaround assertions in your regexes, or can use other mechanisms to inspect its surroundings. For example, you can implement a token that only matches at the beginning of a line: m/\G^comment/mgc. If the previous input were cut off, the current match position would always be at the beginning of a line. Unfortunately, the Perl regex engine has fairly restricted lookaround assertions, in that lookbehinds must use a constant-width pattern.

Conclusion

Matching regexes with m//gc and pos is a must for more advanced parsing techniques, or simply for better error messages.