|
|||||
|
|||||
jparsec2 Tutorial
A Tutorial for jparsec FrameworkIn this tutorial, we will use the classical example: calculator, to show how jparsec's declarative API can be used to construct a parser. Our calculator should be able to:
A typical parsing process takes 2 steps:
TokenizerFirst, let's create our tokenizer. The tokenizer should be able to recognize decimal number, parenthesis and the operators. Input character sequence will be recognized and translated to token. Number literalThe prebuilt Terminals.DecimalLiteral can be used to tokenize decimal literals. OperatorsAn instance of Terminals can be created to manage our operators: private static final Terminals OPERATORS = Terminals.operators("+","-","*","/", "(", ")"); WhitespacesThis calculator allows any number of white spaces before and after any number or operator. static final Parser<Void> IGNORED = Parsers.or(Scanners.JAVA_LINE_COMMENT, Scanners.JAVA_BLOCK_COMMENT, Scanners.WHITESPACES).skipMany(); Put them togetherWith the number literal tokenizer and the OPERATORS constant, our tokenizer will be: static final Parser<?> TOKENIZER = Terminals.DecimalLiteral.TOKENIZER.or(OPERATORS.tokenizer()); And given any syntactical parser, we will be able to chain it with the tokenizer and ignore the whitespaces and comments: Parser<T> grammar = ...; Parser<T> parser = grammar.from(TOKENIZER, IGNORED); Parser and InterpreterWith the tokenizer taking care of lexical analysis, we can now focus on grammar rules. Decimal NumberFrom the tokenizer, we could have two different kinds of tokens: decimal numbers and operators. To convert the decimal number token to a Double object, we use the corresponding syntactic parser for decimal literal and transform the string result: static final Parser<Double> NUMBER = Terminals.DecimalLiteral.PARSER.map(new Map<String, Double>() { public Double map(String s){ return Double.valueOf(s); } }); Modeling OperatorsBefore proceeding to parsing the operators, let's first model them using enum. The binary operators (+, -, *, /) implement Binary to facilitate calculation; and negative operator (-) implement Unary. We will later on declare them in a OperatorTable. enum BinaryOperator implements Binary<Double> { PLUS { public Double map(Double a, Double b) { return a + b; } }, MINUS { public Double map(Double a, Double b) { return a - b; } }, MUL { public Double map(Double a, Double b) { return a * b; } }, DIV { public Double map(Double a, Double b) { return a / b; } } } enum UnaryOperator implements Unary<Double> { NEG { public Double map(Double n) { return -n; } } } Parsing OperatorsAll the operators are managed by the OPERATORS constant. The Terminals.token() method returns a parser that recognizes operator. A convenience method can be created on top of it to save a few keystrokes for us: static Parser<?> term(String... names) { return OPERATORS.token(names); } For each operator, the parser will first recognize the token using the term() method and then return the corresponding BinaryOperator/UnaryOperator instance, as in: static <T> Parser<T> op(String name, T value) { return term(name).retn(value); } Whitespace OperatorAt the beginning, we said we want to allow whitespace to be an alternative syntax for multiplication. Well, formally speaking, whitespace itself is not enough to make an operator. For example, "2 -3" should still be parsed as a "minus", not a multiplication between "2" and "-3". Therefore, our whitespace operator should only happen when none of "+", "-", "*", "/" is present. Therefore we have: static final Parser<BinaryOperator> WHITESPACE_MUL = term("+","-","*","/").not().retn(BinaryOperator.MUL); Operator PrecedenceIn a classic recursive-desent parser, operators with different precedence are handled by defining different production rules, for example: term_expr ::= ...
muldiv_expr ::= muldiv_expr ('*'|'/') term_expr
expr = expr ('+'|'-') muldiv_expr
This solution can lead to messy production rules when the number of operators and precedence levels scale up. It is more desirable to be able to specify the operator precedence declaratively. Left RecursionAnother drawback of recursive-desent parser is left recursion. In order to preserve the left-associative nature of the operators, the above production rules, are defined left-resursively (the first rule at the right hand side of muldiv_expr is muldiv_expr itself). Such left-recursive definition will lead to infinite-recursion in a naive recursive-desent parser implementation. OperatorTableJparsec provides support for operator precedence grammar and automatically handles left recursion incured by left-associative operators. Our target syntax should look like: Parser<Doubel> unit = ...; Parser<Double> parser = new OperatorTable<Double>() .infixl(op("+", BinaryOperator.PLUS), 10) .infixl(op("-", BinaryOperator.MINUS), 10) .infixl(op("*", BinaryOperator.MUL).or(WHITESPACE_MUL), 20) .infixl(op("/", BinaryOperator.DIV), 20) .prefix(op("-", UnaryOperator.NEG), 30) .build(unit); The higher the precedence number, the more tightly it binds the operands. Parens and RecursionIn order to support parens, recursion needs to be involved. An expression inside a pair of parens is itself another expression, which can be nested inside another pair of parens. The way to handle recursion is to use Parser.Reference: Parser.Reference<Double> ref = Parser.newReference(); Parser<Double> parenthesized = ref.lazy().between(term("("), term(")")); If you run the "parenthesized" parser above, it will fail with some error message about the reference is not set yet. And that is accurately what we are missing here. The reference needs to finally be set with the parser for expression, because, well, literally any expression can be nested within a pair of parens. So, putting it together with the operator table and whitespace operator, we have: static Parser<Double> calculator(Parser<Double> atom) { Parser.Reference<Double> ref = Parser.newReference(); Parser<Double> unit = ref.lazy().between(term("("), term(")")).or(atom); Parser<Double> parser = new OperatorTable<Double>() .infixl(op("+", BinaryOperator.PLUS), 10) .infixl(op("-", BinaryOperator.MINUS), 10) .infixl(op("*", BinaryOperator.MUL).or(WHITESPACE_MUL), 20) .infixl(op("/", BinaryOperator.DIV), 20) .prefix(op("-", UnaryOperator.NEG), 30) .build(unit); ref.set(parser); // DO NOT FORGET THIS! return parser; } And that is almost everything we need. Oh, wait, let's pass the NUMBER parser in and hook it up with the lexer to get the final parser: static final Parser<Double> CALCULATOR = calculator(NUMBER).from(TOKENIZER, IGNORED); And it doesn't hurt to run it and see how it happily calculates, does it? System.out.println(CALCULATOR.parse("1 + 2 (4 - 3) /*comment*/")); The EndThe final parser code is as following: public class Calculator { enum BinaryOperator implements Binary<Double> { PLUS { public Double map(Double a, Double b) { return a + b; } }, MINUS { public Double map(Double a, Double b) { return a - b; } }, MUL { public Double map(Double a, Double b) { return a * b; } }, DIV { public Double map(Double a, Double b) { return a / b; } } } enum UnaryOperator implements Unary<Double> { NEG { public Double map(Double n) { return -n; } } } static final Parser<Double> NUMBER = Terminals.DecimalLiteral.PARSER.map(new Map<String, Double>() { public Double map(String s) { return Double.valueOf(s); } }); private static final Terminals OPERATORS = Terminals.operators("+", "-", "*", "/", "(", ")"); static final Parser<Void> IGNORED = Parsers.or(Scanners.JAVA_LINE_COMMENT, Scanners.JAVA_BLOCK_COMMENT, Scanners.WHITESPACES).skipMany(); static final Parser<?> TOKENIZER = Parsers.or(Terminals.DecimalLiteral.TOKENIZER, OPERATORS.tokenizer()); static Parser<?> term(String... names) { return OPERATORS.token(names); } static final Parser<BinaryOperator> WHITESPACE_MUL = term("+", "-", "*", "/").not().retn(BinaryOperator.MUL); static <T> Parser<T> op(String name, T value) { return term(name).retn(value); } static Parser<Double> calculator(Parser<Double> atom) { Parser.Reference<Double> ref = Parser.newReference(); Parser<Double> unit = ref.lazy().between(term("("), term(")")).or(atom); Parser<Double> parser = new OperatorTable<Double>() .infixl(op("+", BinaryOperator.PLUS), 10) .infixl(op("-", BinaryOperator.MINUS), 10) .infixl(op("*", BinaryOperator.MUL).or(WHITESPACE_MUL), 20) .infixl(op("/", BinaryOperator.DIV), 20) .prefix(op("-", UnaryOperator.NEG), 30) .build(unit); ref.set(parser); return parser; } public static final Parser<Double> CALCULATOR = calculator(NUMBER).from(TOKENIZER, IGNORED); } SummaryIn this tutorial, we chose to run the calculation directly in the semantic actions. We could defer this calculation and create intermediary abstract syntax tree for further analysis and processing (such as type checking and code optimisation). But that makes no real difference in the parser code. We just need to replace the direct calculation code with the code that creates the abstract syntax tree (of course, we also need to replace <Double> with <whatever your expression type>). This short tutorial cannot cover every detail of the jparsec API. There are many other useful utility methods provided to help the process of constructing a complex parser. Please refer to the jparsec API Documentation for further information. |
|||||
|
Copyright 2003-2006 - The Codehaus. All rights reserved unless otherwise noted.
Powered by Atlassian Confluence
|
|||||