|
|||||
|
|||||
NParsec Tutorial
A Tutorial for NParsec FrameworkIn this tutorial, we will use the classical example: calculator, to show how NParsec's declarative API can be used to create a parser. NParsec can be downloaded here. The tutorial will use C# 2.0 syntax (generics and anonymous delegates) since NParsec is built exclusively for C# 2.0 or higher. Our calculator should be able to:
A typical parsing process takes 2 steps:
Import relavant NParsec classesThe following code imports all relevant NParsec types.
LexerFirst, let's create our lexer. The lexer should be able to recognize decimal number, parenthesis and the operators. Input character sequence will be recognized and translated to tokens. To recognize a decimal number, we could use regular expression. NParsec has support for regular expression-based scanner. We will use straight NParsec API, however, to present a more intuitive way. Decimal Number ScannerThe Parsec.Pattern.Pattern and Parsec.Pattern.Patterns are two useful classes to represent any sophisticated pattern of input characters. The following code represents a pattern of integer:
A decimal number is 1 or more digits optionally followed by a '.' and 1 or more digits as the fractional part:
Finally, we use this Pattern object to create our a Scanner object:
In fact, NParsec does have built-in support for frequently used lexers such as integer, string literal, decimal number etc. They can be found in the Lexers class. To better serve the "tutorial" purpose, we chose to write the lexers from scratch instead of using these pre-built functions. Decimal Number LexerA Scanner object only matches input. It knows the position and the length of the match, but it doesn't return any token. We need to use a Parsec.Tokenizer object to convert the matched sub-sequence to a token object:
Lexer for OperatorsTo create lexers for the operators, the helper class Terms can be used:
WhitespacesNow, we are done with all the necessary tokens of the calculator. The only thing left is the white spaces. This calculator allows any number of white spaces before and after any number or operator.
The calculator will ignore any one of the above 3, so:
Put them togetherThe lexer that recognize and return tokens is:
The lexer object should greedily scan the entire input, ignore the whitespaces and comments, recognize the tokens and generate an array of the tokens.
Parser and InterpreterA Parser object can be a character level - which is fed a sequence of characters, or a token level - which is fed with an array of tokens. The lexer is a character level object, we need a token level Parser object to finish the second phase of the job. The token level Parser object should recognize different tokens and does the interpretation based on the grammar rules. Decimal Number ParserFrom the lexer, we could have two different kinds of tokens: decimal numbers and operators. To convert the decimal number token to a double value, we can use the helper class Terms:
Operator ParserThe lexer already recognizes the operators and converted them into special tokens. It is now time to apply grammar rules to these tokens. Using the Terms object created in the lexer section, we could easily get the Grammar objects for these operators. For example, the "+" operator can be obtained as: The above Term object only recognizes the "+" operator token, we also need to attach a semantic action to it(in our case, apply the plus operation). The same thought applies to all other operators. For binary operators, a Binary object is used to represent the semantic action: For unary operators, a Unary object can be used to represent semantic action: We can easily attach a Binary object to an operator as such: Similar code can be applied to all the other operators. To avoid code duplication, we can create a function: Thus, the plus operator becomes: Operator PrecedenceIn a classic recursive-desent parser, operators with different precedence are handled by defining different production rules, for example: This solution can lead to messy production rules when the number of operators and precedence levels scales up. It is more desirable to be able to specify the operator precedence declaratively. Left RecursionAnother drawback of recursive-desent parser is left recursion. In order to preserve the left-associative nature of the operators, the above production rules, are defined left-resursively (the first rule at the right hand side of muldiv_expr is muldiv_expr itself). Such left-recursive definition will lead to infinite-recursion in a naive recursive-desent parser implementation. Operator TableNParsec provides support for operator precedence grammar and automatically handles left recursion incured by left-associative operators. Our target syntax should look like: where all the operator precedence and associativity are specified declaratively.
Whitespace OperatorAt the beginning, we said we want to allow whitespace to be an alternative syntax for multiplication. Well, formally speaking, whitespace itself is not enough to make an operator. For example, "2 -3" should still be parsed as a "minus", not a multiplication between "2" and "-3". Therefore, our whitespace operator should only happen when none of "+", "-", "*", "/" is present. The Parser object that recognizes any of the four arithmetic operators is: And the parser that tells us "none of the four operators are present, this is a valid whitespace operator" is:
Semantic ActionThere are some preparation to do though. Firstly, the Infixl(...) method expects a Parser<Binary> object. Similarly, the Prefix(...)| method expects a Parser<Unary>, where the Unary object is responsible for implementing the semantic action of the unary operator. These Parser<Binary> and Parser<Unary> objects can be easily obtained from the getOperator() method we just defined. Now, with the OperatorTable object we created earlier, we can use the Expressions helper class to build a parser that handles the operators correctly: where the p_number is the parser that parses a decimal number and evaluates to a double value; the optable is the OperatorTable object that declares all the operator precedences and associativity. Executing Parser objectWith the lexer and the parser objects ready, we can execute them to calculate real expressions. First, we need to concatenate the lexer and the parser objects so that the parser object will take as input the output of the lexer object:
The parser object can then be used to actually parse/interpret expression:
Now if we run the above code, we will get -7 as the result. Mutual RecursionOur Parser object works except there's one thing missing: parenthesis. We wanted the parser to be able to parse sub-expression grouped by parenthesis. The production rule is:
Lazy EvaluationIt is impossible to represent such mutual-recursive production with declarations of local Parser variables. Lazy evaluation is necessary in this case:
Thus, the expression parser can be changed to:
And There We GoThe final parser code is as following: SummaryIn this tutorial, we chose to run the calculation directly in the semantic actions. We could defer this calculation and create intermediary abstract syntax tree for further analysis and processing (such as type checking and code optimisation). But that makes no real difference in the parser code. We just need to replace the direct calculation code with the code to create the abstract syntax tree. This short tutorial cannot cover every detail of the NParsec API. There are many other useful utility functions provided to help the process of constructing a sophisticated parser. You are welcome to shoot me a note to ajoo.email@gmail.com should you have any comment or question. |
|||||
|
Copyright 2003-2006 - The Codehaus. All rights reserved unless otherwise noted.
Powered by Atlassian Confluence
|
|||||