Shaving the yak, a grammar parser-generator for modified EBNF, Part 1

· 1630 words · 8 minute read

I decided to shave a yak.

I've been working my way through Bob Nylund's excellent Crafting Interpreters and I got as far as chapter 7 in which you write a parser for your grammar. In the chapter, Bob discusses the grammar that we are to implement in the book. I decided, on a lark, that I would write a code generator that takes (something similar to) Bob's grammar spec and spits out serviceable C# code that can parse the grammar.

I realize, this might be a bit of a fool's errand. It won't win any awards, and its kludge is mediocre – I don't dare call my work here a proper kludge, it isn't nearly obtuse or WTF enough to merit that. But this is something I've been enjoying on my nights and weekends, so I thought perhaps someone would enjoy the story of its creation.

This begins a short series on the implementation of my lexer/parser/compiler for a modified EBNF language. This installment discusses the rudimentary lexer that also happens to do a bit of parsing. First, let's discuss the grammar grammar we'll be implementing.

The Grammar Grammar

The grammar below is describing itself. I have no idea if it makes it meta-circular. The structure is pretty simple.

    grammar -> rule+ EOF ;
    rule -> name "->" expression ";" ;
    expression -> concat ( "*" | "+" | "?" )? ;
    concat -> altern ( "+" altern )* ;
    altern -> group ( "|" group )* ;
    group -> "(" base ")" | base ;
    base -> NAME | STRING | TERMINAL | expansion ;
    NAME-> r"[a-zA-Z_][a-zA-Z0-9_]*" ;
    STRING -> r"'[^']*'" ;
    TERMINAL -> r"[A-Z][A-Z0-9_]*" ;

Beginning from the first line.

A "grammar" consists of one or more rules. I've borrowed the postfix "+" from regular expressions.

A "rule" consists of a name, an assignment (I read -> as "is"), and an expression. I opted to end a rule definition with a semicolon because trying to work out the syntax of spaces, where spaces are only syntactically important part of the time, was an utter headache. I should note that what I'm describing is my third draft of this tool. In a previous draft I tried to use spaces to delimit rules and it proved to be an utter headache.

An expression is one of alternation, concatenation, or a group.

  • rule1 rule2 rule3 would be concatenation, where we expect to match rule1, then rule2, and finally rule3.
  • rule1 | rule2 would be alternation, where we expect to match first rule1, and if not, then rule2.
  • ( rule1 ) is a group, which syntactically means "evaluate everything in the parens before attempting evaluation with the group and something outside of it.

After that we have terminals, sometimes called "literals". "Literal" isn't quite right here. For now I'm sticking with the definition "a literal is a thing in code that has little to no abstraction on top of it".

For terminals I've defined two requirements; first a terminal rule name is ALL CAPS. Second, a terminal rule definition is a regular expression. In this vein, the string literals you see sprinkled through the grammar above (such as the parens in "(" base ")" | base ;) require exact matches. Both literals and terminals pass through a regular expression engine, so we could make our literals a lot more complex if we wanted to.

And that's it for the grammar spec! Now let's talk about lexing/tokenizing.

The Grammar Lexer

Defining Tokens

The first thing we need to decide is "what are we trying to find?". A lexer traditionally takes some input (maybe a string, maybe something else) and outputs tokens. A token is what we call "the thing we're trying to find". But what are we trying to find? We specify our token types – ie, the names of things we're trying to find that are tokens – based on the grammar above. While I was describing the grammar, I was treating it as something with underlying structure and meaning. Now we need to treat it as a raw string with no meaning or structure attached. After all, we are building something that attaches meaning to some arbitrary input.

We begin with an enum in Python. I tried to be clever and embed all the necessary information into one structure while still dealing with objects, but I was not successful. So our token types are defined in two parts. First an enum

class TokenType(Enum):
    Name = 1
    String = 2
    Terminal = 3
    Arrow = 4
    Semicolon = 5
    Newline = 6
    LParen = 7
    RParen = 8
    Pipe = 9
    Plus = 10
    Star = 11
    Question = 12
    EOF = 13
    Whitespace = 14
    ForwardSlash = 15

I considered doing what Peter Norvig did in his Scheme in Python implementation, where whitespace is the delimiter for everything, but I want concatenation to be more elegant than rule1 + rule2. If we use whitespace as the operator, we something like rule1 rule2. This is purely styalistic, and as you'll learn it caused me quite a headache.

We then create a Token class to house various bits of information about our token.

class Token(NamedTuple):
    """A token is a named tuple of the token type and the lexeme."""
    token_type: TokenType
    lexeme: str = None
    lineLocation: int = 0
    columnLocation: int = 0

    def __repr__(self):
        return self.__str__()

    def __str__(self):
        return f"Token({self.token_type}, '{self.lexeme}', line: {self.lineLocation}, col: {self.columnLocation})"

This is pretty straight forward, containing mostly error and debug information.

You'll see the final bit of information about tokens once we get into the Tokenizer class.

Tokenizer/Lexer

Note on definitions: I consider a token to be a found thing and its related information, and a lexeme to be the "definition" of the thing we found.

__init__=

class Tokenizer:
    """Tokenizes a given grammar."""

    def __init__(self, _grammar: str):
        self.grammar = _grammar
        self.line = 0
        self.column = 0
        self.tokens = []

Nothing really to say here, except that an instance of Tokenizer implicitly has a grammar already associated with it. It occurred to me that I could probably build a generator that takes some kind of streaming input and outputs tokens. This would be stateless, and unless we were clever, we would lose the ability to have absolute line and column numbers.

Here's the final bit of lexeme information that ties everything together.

lexemes

This completes our defition for "lexeme". Here we pair the enum with a match string. All of these strings will be used to match on things within our text.

    lexemes = {

        r"[A-Z][A-Z0-9_]*": TokenType.Terminal,
        r"[a-zA-Z_][a-zA-Z0-9_]*": TokenType.Name,
        r"(\"[^\n\r]*\"|\'[^\n\r]*\')(?<!\\\")": TokenType.String,
        r"->": TokenType.Arrow,
        r";": TokenType.Semicolon,
        r"\(": TokenType.LParen,
        r"\)": TokenType.RParen,
        r"\|": TokenType.Pipe,
        r"\+": TokenType.Plus,
        r"\*": TokenType.Star,
        r"\?": TokenType.Question,
        " ": TokenType.Whitespace,
        "\n": TokenType.Newline,
    }

The order of the lexemes matters. If a lexeme is a substring of another lexeme, it must come after the containing lexeme because regex matches greedily. Here, the lexemes that are affected are the Terminal and Name lexemes. Terminal happens to be all caps, and Name can be all caps, so we need to check for Terminal first.

Now, finally, the meat.

tokenize(self)

    def tokenize(self):
        # Strip whitespace from front and back of grammar.
        self.grammar = self.grammar.strip()
        # Strip out extra spaces.
        self.grammar = re.sub('[ \r\t]+', ' ', self.grammar)

The first thing we do is strip off any extra space before and after all the text, as well as compress any spaces in the text. Note that we don't strip out newlines. We'll be turning those into tokens.

Next, we tokenize. :)

       # Tokenize the grammar.
        _grammar = self.grammar

        while _grammar:

            token = None
            for lexeme, token_type in self.lexemes.items():
                match = re.match(lexeme, _grammar)
                if match:
                    token = Token(token_type, match.group(0), self.line, self.column)
                    _grammar = _grammar[match.end():]

                    # Check if we've entered a new line.
                    if token.token_type == TokenType.Newline:
                        self.line += 1

                        # Reset the column number when we're done with a line.
                        self.column = 0

                    # We strip off the front of the list after every match, so we use the previous match end as our
                    # index.
                    self.column += match.end()
                    break
            else:
                raise LexException("Could not match lexeme:\n" +
                                self.grammar.split('\n')[self.line] + "\n" +
                                " " * self.column + "^" + "\n")

             self.tokens.append(token)

        self.tokens.append(Token(TokenType.EOF, None, self.line, self.column))

        return self.tokens

The core of the functionality is the while loop. While we have anything left in _grammar, keep trying to do stuff.

Then we loop through each lexeme we could expect, for lexeme, token_type inb self.lexemes.items() unpacks our lexeme definition. I think there is a more elegent way to do this, but this brute force method works. Unfortunately that makes our time complexity for generating tokens m x n where m is the number of lexemes and n is the number of tokens we could possibly find. The other part of improving on this would necessitate bringing parsing into our lexer in order to intelligently search for tokens, which is no bueno.

We run the lexeme through a regex match, and check if we found a match. If we did, a few important things happen; first, we generate a token with the appropriate bits of information; second, we strip off the token we found with _grammar = _grammar[match.end()]. This is how we move forward in this implementation. When we make it to a point where _grammar is empty, ie "", then we're done.

If we found a newline we need to increment our self.line and self.column state. We also increment our column number.

Finally, we add our found token to our list of self.tokens.

Once _grammar is empty, we add an EOF token and return all the tokens to the caller.

Fin

Up next is the parser. There were three drafts and variations on the grammar we discussed. I'm not sure I want to talk about dead-ends. Anyhow, I hope you enjoyed this somewhat theatrical tour of some very simple code.