Thursday, February 14, 2013

Tree Grammars - Abstract Syntax Trees in ANTLR

This tutorial shows how to create an abstract syntax tree (AST) from a combined grammar in ANTLR. Let's start with a simple grammar in ANTLR.

grammar someGrammar;

options {
  language = Java;
  output = AST;
}


program

    :
        variableDeclaration+
    ;
   
variableDeclaration

    :
        type ID ';'

    ;

type
    :
        'int' | 'char'
    ;

ID
    :
        ('a'..'z' | 'A'..'Z' | '_') ('0'..'9' | 'a'..'z' | 'A'..'Z' | '_')*
    ;

WS : ('\t' | '\r\' | '\n' | ' ')+ { $channel = HIDDEN; } ;


The first step in creating an AST is to make sure that we have output = AST; defined in the options section of your grammar. However, this alone, will not create an AST. Alone, this will create a parse tree, which is essentially a general form of an AST. A parse tree will consist of everything in our source code, including noise that doesn't provide any semantic meaning to your language, such as semicolons that terminate statements. Moreover, the parse tree will be a flat tree (i.e. just a flat list of tree nodes). What we'd like, however, is a tree with depth and structure, like this.

 

To remove the noise that's inherent in a parse tree, give it depth/structure, and generate an AST that only includes the relevant information we need, the easiest method is to add rewrite rules to our grammar. These rewrite rules tell ANTLR how we want our AST to be structured. Here's how we would add a rewrite rule to the variableDeclaration definition to tell ANTLR to restructure variableDeclaration to a more useful AST node. Note that I've bolded the two changes that I made to the grammar. I added a tokens section, in addition to a rewrite rule to the variableDeclaration.

grammar someGrammar;

options {
  language = Java;
  output = AST;
}


tokens {
  VAR;
}

program

    :
        variableDeclaration+
    ;

variableDeclaration
    :
        type ID ';'

        ->   ^(VAR type ID)
    ;

type
    :
        'int' | 'char'
    ;

ID
    :
        ('a'..'z' | 'A'..'Z' | '_') ('0'..'9' | 'a'..'z' | 'A'..'Z' | '_')*
    ;

WS : ('\t' | '\r\' | '\n' | ' ')+ { $channel = HIDDEN; } ;



The tokens section defines an imaginary token/node called VAR that we later use in the rewrite rule. This was created because we want to know that it's a variable declaration; however, there is no VAR keyword in the grammar. The rewrite rule is defined as ^(VAR type ID). Note that the terminating semicolon isn't part of this rule. That's because we don't want the semicolon in our AST. The semicolon is important to the syntax of our language to terminate statements, but once we've parsed the input source code, we don't need the semicolon, because it doesn't provide any semantic meaning.

In summary, an AST is a tree with depth and structure, that eliminates anything from the input source code that doesn't provide semantic meaning. In addition, it can (as in this example), add imaginary tokens to help add necessary structure to support the semantic meaning of the source code. In a later tutorial, I'll demonstrate how to write a tree walker to process the resulting AST.



No comments:

Post a Comment