Translation Part 2 via Abstract Syntax Tree

Posted by Aaron Feng Fri, 27 Apr 2007 02:27:00 GMT

Last post I created a simple translator by using ANTLR V3 that translated C# (or Java) like syntax into PL/SQL. The translation relied heavily on String Template and token rewrite. It generated the following PL/SQL statements from a single class declaration:

  • Drop table
  • Drop sequence
  • Create table
  • Create sequence
  • Create Trigger (sequence and trigger are for auto increment id column)

Notice there was not a one-to-one relationship between the input and the output. Output like this is ideal for template based translation, but not when you are constructing an Abstract Syntax Tree (AST). The reason is the tree nodes will have to be duplicated in order to render multiple items from a single input.

In this example, my language will stay the same, but output will only contain the creation of table. In the future post, I will change the language into something simpler, and the ability to output all the previous items (drop table, create auto-incremented column) separately. In this example I only want to focus on the basics on creating AST by keeping everything else as simple as possible.

Let's take look at the new parser and lexer grammar.

// CSharpSQL.g

grammar CSharpSQL;
options {
 output = AST;
 ASTLabelType = CommonTree;
}

tokens {
 CLASSDEF;
 VARDEF;
}

// parser
program : (declaration { System.out.println($declaration.tree.toStringTree()); } ) + ;

declaration : class_statement '{' (variable_statement)* '}'
                -> ^(class_statement variable_statement*) ;

class_statement : scope_modifier 'class' ID
                   -> ^(CLASSDEF ID) ;

variable_statement : scope_modifier type ID  ';'
                      -> ^(VARDEF type ID) ;

scope_modifier : 'public' ;

// more can be added
type : 'string'
     | 'int'
     | 'decimal'
     | 'DateTime' ;

// lexer
ID  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_') * ;

WS  :   ( ' ' | '\t' | '\r' | '\n' )+ { $channel = HIDDEN; } ;

The first thing to notice is the tokens declaration before the program rule. Since the the output will be a tree, tokens declaration defines a set of virtual tokens that can be used as tree nodes. Virtual token help the tree parser to understand the relationship between nodes. In this example there are two virtual tokens, CLASSDEF and VARDEF. We need a way to distinguish the difference between variable name and the name of the class since we cannot tell by the token itself. Class name will have CLASSDEF as its root node, and variable name will have VARDEF as its root node.

Each rule returns a subtree by "rewriting" the input using the -> notation. Everything on the right hand side of -> encodes the hierarchy of the subtree. The token closest to the ^ symbol is the root node, and rest within the parentheses are the children of the root. Notice that we only keep the meaningful nodes in the tree, and discard any unnecessary input by "rewriting" them. Some rules do not return a subtree, but simply delegate the other rules to return the subtree.

Now we need a tree walker to visit each node in the tree so it can render the final output. The tree walker grammar is shown below:

// Translate.g
tree grammar Translate;

options {
 tokenVocab = CSharpSQL;
 ASTLabelType = CommonTree;
}

@members {
 String className;
 List<String> columns = new ArrayList<String>();
}

program : (declaration
          {
           String table = "CREATE TABLE " + className + '\n' + "(" + '\n';
           String seperator = ",";
           Object[] arrayColumns = columns.toArray();
           for(int i = 0; i < arrayColumns.length; i++) {
            if(i == arrayColumns.length - 1) seperator = "";
            table += " " + arrayColumns[i].toString() + seperator + '\n';
           }
           table += ")";
           System.out.println(table);
           columns.clear();
          } ) + ;

declaration : ^(CLASSDEF ID variable_statement*) { className = $ID.text; } ;

variable_statement : ^(VARDEF type ID)
                     {
                      columns.add($ID.text + " " + $type.value + " NOT NULL");
                     } ;

type returns [String value]
    : 'string' { $value = "nvarchar(255)"; }
    | 'int'   { $value = "integer"; }
    | 'decimal' { $value = "number(21,6)"; }
    | 'DateTime'  { $value = "date"; };

Here is where all the translation happens. Once again we have a couple of member variables declared in @members to help collect all the required fields for the translation. The tree grammar rule is very similar to parser or lexer rule. Since the input is a tree, you need to define rules that will match the nodes in the tree. You construct the rules by specifying the relationships between nodes using the same syntax parser used to create the tree. We know whenever the tree walker sees the CLASSDEF one of the child nodes (ID) is the class name. CLASSDEF can have zero or more variable_statement which means it can have zero or more VARDEF as its children. Each VARDEF contains exactly two children nodes, a type and ID which is used to construct column definition.

Let's run the recognizer using the following test program:

// Test.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import java.io.*;

public class Test {
 public static void main(String[] args) throws Exception {
  ANTLRInputStream input = new ANTLRInputStream(System.in);
  CSharpSQLLexer lexer = new CSharpSQLLexer(input);
  TokenRewriteStream tokens = new TokenRewriteStream(lexer);

  CSharpSQLParser parser = new CSharpSQLParser(tokens);
  CSharpSQLParser.program_return r = parser.program();

  CommonTree tree = (CommonTree)r.getTree();
  CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
  Translate walker = new Translate(nodes);
  walker.program();
 }
}

Notice that the output of the parser is passed into the Translate tree walker for the translation.

Comments

Leave a response

Comments