diff --git a/docs/sly.rst b/docs/sly.rst index 32b43a1..8d56f41 100644 --- a/docs/sly.rst +++ b/docs/sly.rst @@ -375,29 +375,6 @@ parser is often a hard problem. An error handler might scan ahead to a logical synchronization point such as a semicolon, a blank line, or similar landmark. -EOF Handling -^^^^^^^^^^^^ - -The lexer will produce tokens until it reaches the end of the supplied -input string. An optional ``eof()`` method can be used to handle an -end-of-file (EOF) condition in the input. For example:: - - class MyLexer(Lexer): - ... - # EOF handling rule - def eof(self): - # Get more input (Example) - more = input('more > ') - return more - -The ``eof()`` method should return a string as a result. Be aware -that reading input in chunks may require great attention to the -handling of chunk boundaries. Specifically, you can't break the text -such that a chunk boundary appears in the middle of a token (for -example, splitting input in the middle of a quoted string). For -this reason, you might have to do some additional framing -of the data such as splitting into lines or blocks to make it work. - Maintaining extra state ^^^^^^^^^^^^^^^^^^^^^^^ @@ -421,3 +398,1780 @@ example:: Please note that lexers already use the ``lineno`` and ``position`` attributes during parsing. +Writing a Parser +---------------- + +The ``Parser`` class is used to parse language syntax. Before showing +an example, there are a few important bits of background that must be +mentioned. First, *syntax* is usually specified in terms of a BNF +grammar. For example, if you wanted to parse simple arithmetic +expressions, you might first write an unambiguous grammar +specification like this:: + + expr : expr + term + | expr - term + | term + + term : term * factor + | term / factor + | factor + + factor : NUMBER + | ( expr ) + +In the grammar, symbols such as ``NUMBER``, ``+``, ``-``, ``*``, and +``/`` are known as *terminals* and correspond to raw input tokens. +Identifiers such as ``term`` and ``factor`` refer to grammar rules +comprised of a collection of terminals and other rules. These +identifiers are known as *non-terminals*. + +The semantic behavior of a language is often specified using a +technique known as syntax directed translation. In syntax directed +translation, attributes are attached to each symbol in a given grammar +rule along with an action. Whenever a particular grammar rule is +recognized, the action describes what to do. For example, given the +expression grammar above, you might write the specification for a +simple calculator like this:: + + Grammar Action + ------------------------ -------------------------------- + expr0 : expr1 + term expr0.val = expr1.val + term.val + | expr1 - term expr0.val = expr1.val - term.val + | term expr0.val = term.val + + term0 : term1 * factor term0.val = term1.val * factor.val + | term1 / factor term0.val = term1.val / factor.val + | factor term0.val = factor.val + + factor : NUMBER factor.val = int(NUMBER.val) + | ( expr ) factor.val = expr.val + +A good way to think about syntax directed translation is to view each +symbol in the grammar as a kind of object. Associated with each symbol +is a value representing its "state" (for example, the ``val`` +attribute above). Semantic actions are then expressed as a collection +of functions or methods that operate on the symbols and associated +values. + +SLY uses a parsing technique known as LR-parsing or shift-reduce +parsing. LR parsing is a bottom up technique that tries to recognize +the right-hand-side of various grammar rules. Whenever a valid +right-hand-side is found in the input, the appropriate action code is +triggered and the grammar symbols are replaced by the grammar symbol +on the left-hand-side. + +LR parsing is commonly implemented by shifting grammar symbols onto a +stack and looking at the stack and the next input token for patterns +that match one of the grammar rules. The details of the algorithm can +be found in a compiler textbook, but the following example illustrates +the steps that are performed if you wanted to parse the expression ``3 ++ 5 * (10 - 20)`` using the grammar defined above. In the example, +the special symbol ``$`` represents the end of input:: + + Step Symbol Stack Input Tokens Action + ---- --------------------- --------------------- ------------------------------- + 1 3 + 5 * ( 10 - 20 )$ Shift 3 + 2 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER + 3 factor + 5 * ( 10 - 20 )$ Reduce term : factor + 4 term + 5 * ( 10 - 20 )$ Reduce expr : term + 5 expr + 5 * ( 10 - 20 )$ Shift + + 6 expr + 5 * ( 10 - 20 )$ Shift 5 + 7 expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER + 8 expr + factor * ( 10 - 20 )$ Reduce term : factor + 9 expr + term * ( 10 - 20 )$ Shift * + 10 expr + term * ( 10 - 20 )$ Shift ( + 11 expr + term * ( 10 - 20 )$ Shift 10 + 12 expr + term * ( 10 - 20 )$ Reduce factor : NUMBER + 13 expr + term * ( factor - 20 )$ Reduce term : factor + 14 expr + term * ( term - 20 )$ Reduce expr : term + 15 expr + term * ( expr - 20 )$ Shift - + 16 expr + term * ( expr - 20 )$ Shift 20 + 17 expr + term * ( expr - 20 )$ Reduce factor : NUMBER + 18 expr + term * ( expr - factor )$ Reduce term : factor + 19 expr + term * ( expr - term )$ Reduce expr : expr - term + 20 expr + term * ( expr )$ Shift ) + 21 expr + term * ( expr ) $ Reduce factor : (expr) + 22 expr + term * factor $ Reduce term : term * factor + 23 expr + term $ Reduce expr : expr + term + 24 expr $ Reduce expr + 25 $ Success! + +When parsing the expression, an underlying state machine and the +current input token determine what happens next. If the next token +looks like part of a valid grammar rule (based on other items on the +stack), it is generally shifted onto the stack. If the top of the +stack contains a valid right-hand-side of a grammar rule, it is +usually "reduced" and the symbols replaced with the symbol on the +left-hand-side. When this reduction occurs, the appropriate action is +triggered (if defined). If the input token can't be shifted and the +top of stack doesn't match any grammar rules, a syntax error has +occurred and the parser must take some kind of recovery step (or bail +out). A parse is only successful if the parser reaches a state where +the symbol stack is empty and there are no more input tokens. + +It is important to note that the underlying implementation is built +around a large finite-state machine that is encoded in a collection of +tables. The construction of these tables is non-trivial and +beyond the scope of this discussion. However, subtle details of this +process explain why, in the example above, the parser chooses to shift +a token onto the stack in step 9 rather than reducing the +rule ``expr : expr + term``. + +Parsing Example +^^^^^^^^^^^^^^^ +Suppose you wanted to make a grammar for simple arithmetic expressions as previously described. +Here is how you would do it with SLY:: + + + from sly import Parser + from calclex import CalcLexer + + class CalcParser(Parser): + # Get the token list from the lexer (required) + tokens = CalcLexer.tokens + + # Grammar rules and actions + @_('expr PLUS term') + def expr(self, p): + p[0] = p[1] + p[3] + + @_('expr MINUS term') + def expr(self, p): + p[0] = p[1] - p[3] + + @_('term') + def expr(self, p): + p[0] = p[1] + + @_('term TIMES factor') + def term(self, p): + p[0] = p[1] * p[3] + + @_('term DIVIDE factor') + def term(self, p): + p[0] = p[1] / p[3] + + @_('factor') + def term(self, p): + p[0] = p[1] + + @_('NUMBER') + def factor(self, p): + p[0] = p[1] + + @_('LPAREN expr RPAREN') + def factor(self, p): + p[0] = p[2] + + # Error rule for syntax errors + def error(self, p): + print("Syntax error in input!") + + if __name__ == '__main__': + lexer = CalcLexer() + parser = CalcParser() + + while True: + try: + text = input('calc > ') + result = parser.parse(lexer.tokenize(text)) + print(result) + except EOFError: + break + +In this example, each grammar rule is defined by a Python function +where the docstring to that function contains the appropriate +context-free grammar specification. The statements that make up the +function body implement the semantic actions of the rule. Each function +accepts a single argument ``p`` that is a sequence containing the +values of each grammar symbol in the corresponding rule. The values +of ``p[i]`` are mapped to grammar symbols as shown here:: + + # p[1] p[2] p[3] + # | | | + @_('expr PLUS term') + def expr(self, p): + p[0] = p[1] + p[3] + +For tokens, the "value" of the corresponding ``p[i]`` is the +*same* as the ``p.value`` attribute assigned in the lexer +module. For non-terminals, the value is determined by whatever is +placed in ``p[0]`` when rules are reduced. This value can be +anything at all. However, it probably most common for the value to be +a simple Python type, a tuple, or an instance. In this example, we +are relying on the fact that the ``NUMBER`` token stores an +integer value in its value field. All of the other rules simply +perform various types of integer operations and propagate the result. + +Note: The use of negative indices have a special meaning in +yacc---specially ``p[-1]`` does not have the same value +as ``p[3]`` in this example. Please see the section on "Embedded +Actions" for further details. + +The first rule defined in the yacc specification determines the +starting grammar symbol (in this case, a rule for ``expr`` +appears first). Whenever the starting rule is reduced by the parser +and no more input is available, parsing stops and the final value is +returned (this value will be whatever the top-most rule placed +in ``p[0]``). Note: an alternative starting symbol can be +specified using the ``start`` attribute in the class. + +The ``error()`` method is defined to catch syntax errors. +See the error handling section below for more detail. + +If any errors are detected in your grammar specification, SLY will +produce diagnostic messages and possibly raise an exception. Some of +the errors that can be detected include: + +- Duplicated grammar rules +- Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars. +- Badly specified grammar rules. +- Infinite recursion (rules that can never terminate). +- Unused rules and tokens +- Undefined rules and tokens + +The final part of the example shows how to actually run the parser. +To run the parser, you simply have to call the ``parse()`` method with +a sequence of the input tokens. This will run all of the grammar +rules and return the result of the entire parse. This result return +is the value assigned to ``p[0]`` in the starting grammar rule. + +Combining Grammar Rule Functions +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When grammar rules are similar, they can be combined into a single function. +For example, consider the two rules in our earlier example:: + + @_('expr PLUS term') + def expr(self, p): + p[0] = p[1] + p[3] + + @_('expr MINUS term') + def expr(self, p): + p[0] = p[1] - p[3] + +Instead of writing two functions, you might write a single function like this: + + @_('expr PLUS term', + 'expr MINUS term') + def expr(self, p): + if p[2] == '+': + p[0] = p[1] + p[3] + elif p[2] == '-': + p[0] = p[1] - p[3] + +In general, the ``@_()`` decorator for any given method can list +multiple grammar rules. When combining grammar rules into a single +function though, it is usually a good idea for all of the rules to have a +similar structure (e.g., the same number of terms). Otherwise, the +corresponding action code may be more complicated than necessary. +However, it is possible to handle simple cases using len(). For +example: + + @_('expr MINUS expr', + 'MINUS expression') + def expr(self, p): + if (len(p) == 4): + p[0] = p[1] - p[3] + elif (len(p) == 3): + p[0] = -p[2] + +If parsing performance is a concern, you should resist the urge to put +too much conditional processing into a single grammar rule as shown in +these examples. When you add checks to see which grammar rule is +being handled, you are actually duplicating the work that the parser +has already performed (i.e., the parser already knows exactly what rule it +matched). You can eliminate this overhead by using a +separate method for each grammar rule. + +Character Literals +^^^^^^^^^^^^^^^^^^ + +If desired, a grammar may contain tokens defined as single character +literals. For example:: + + @_('expr "+" term') + def expr(self, p): + p[0] = p[1] + p[3] + + @_('expr "-" term') + def expr(self, p): + p[0] = p[1] - p[3] + +A character literal must be enclosed in quotes such as ``"+"``. In addition, if literals are used, they must be declared in the +corresponding lexer class through the use of a special ``literals`` declaration:: + + class CalcLexer(Lexer): + ... + literals = ['+','-','*','/' ] + ... + +Character literals are limited to a single character. Thus, it is not +legal to specify literals such as ``<=`` or ``==``. +For this, use the normal lexing rules (e.g., define a rule such as +``EQ = r'=='``). + +Empty Productions +^^^^^^^^^^^^^^^^^ + +If you need an empty production, define a special rule like this:: + + @_('') + def empty(self, p): + pass + +Now to use the empty production, simply use 'empty' as a symbol. For +example:: + + @_('item') + def optitem(self, p): + ... + + @_('empty') + def optitem(self, p): + ... + +Note: You can write empty rules anywhere by simply specifying an empty +string. However, I personally find that writing an "empty" +rule and using "empty" to denote an empty production is easier to read +and more clearly states your intentions. + +Changing the starting symbol +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Normally, the first rule found in a yacc specification defines the +starting grammar rule (top level rule). To change this, supply +a ``start`` specifier in your file. For example:: + + class CalcParser(Parser): + start = 'foo' + + @_('A B') + def bar(self, p): + ... + + @_('bar X') + def foo(self, p): # Parsing starts here (start symbol above) + ... + +The use of a ``start`` specifier may be useful during debugging +since you can use it to work with a subset of a larger grammar. + +Dealing With Ambiguous Grammars +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The expression grammar given in the earlier example has been written +in a special format to eliminate ambiguity. However, in many +situations, it is extremely difficult or awkward to write grammars in +this format. A much more natural way to express the grammar is in a +more compact form like this:: + +expr : expr PLUS expr + | expr MINUS expr + | expr TIMES expr + | expr DIVIDE expr + | LPAREN expr RPAREN + | NUMBER + +Unfortunately, this grammar specification is ambiguous. For example, +if you are parsing the string "3 * 4 + 5", there is no way to tell how +the operators are supposed to be grouped. For example, does the +expression mean "(3 * 4) + 5" or is it "3 * (4+5)"? + +When an ambiguous grammar is given, you will get messages about +"shift/reduce conflicts" or "reduce/reduce conflicts". A shift/reduce +conflict is caused when the parser generator can't decide whether or +not to reduce a rule or shift a symbol on the parsing stack. For +example, consider the string "3 * 4 + 5" and the internal parsing +stack:: + + Step Symbol Stack Input Tokens Action + ---- --------------------- --------------------- ------------------------------- + 1 $ 3 * 4 + 5$ Shift 3 + 2 $ 3 * 4 + 5$ Reduce : expr : NUMBER + 3 $ expr * 4 + 5$ Shift * + 4 $ expr * 4 + 5$ Shift 4 + 5 $ expr * 4 + 5$ Reduce: expr : NUMBER + 6 $ expr * expr + 5$ SHIFT/REDUCE CONFLICT ???? + +In this case, when the parser reaches step 6, it has two options. One +is to reduce the rule ``expr : expr * expr`` on the stack. The other +option is to shift the token ``+`` on the stack. Both options are +perfectly legal from the rules of the context-free-grammar. + +By default, all shift/reduce conflicts are resolved in favor of +shifting. Therefore, in the above example, the parser will always +shift the ``+`` instead of reducing. Although this strategy +works in many cases (for example, the case of +"if-then" versus "if-then-else"), it is not enough for arithmetic expressions. In fact, +in the above example, the decision to shift ``+`` is completely +wrong---we should have reduced ``expr * expr`` since +multiplication has higher mathematical precedence than addition. + +To resolve ambiguity, especially in expression +grammars, SLY allows individual tokens to be assigned a +precedence level and associativity. This is done by adding a variable +``precedence`` to the grammar file like this:: + + class CalcParser(Parser): + ... + precedence = ( + ('left', 'PLUS', 'MINUS'), + ('left', 'TIMES', 'DIVIDE'), + ) + ... + +This declaration specifies that ``PLUS``/``MINUS`` have the +same precedence level and are left-associative and that +``TIMES``/``DIVIDE`` have the same precedence and are +left-associative. Within the ``precedence`` declaration, tokens +are ordered from lowest to highest precedence. Thus, this declaration +specifies that ``TIMES``/``DIVIDE`` have higher precedence +than ``PLUS``/``MINUS`` (since they appear later in the +precedence specification). + +The precedence specification works by associating a numerical +precedence level value and associativity direction to the listed +tokens. For example, in the above example you get:: + + PLUS : level = 1, assoc = 'left' + MINUS : level = 1, assoc = 'left' + TIMES : level = 2, assoc = 'left' + DIVIDE : level = 2, assoc = 'left' + +These values are then used to attach a numerical precedence value and +associativity direction to each grammar rule. *This is always +determined by looking at the precedence of the right-most terminal +symbol.* For example:: + + expr : expr PLUS expr # level = 1, left + | expr MINUS expr # level = 1, left + | expr TIMES expr # level = 2, left + | expr DIVIDE expr # level = 2, left + | LPAREN expr RPAREN # level = None (not specified) + | NUMBER # level = None (not specified) + +When shift/reduce conflicts are encountered, the parser generator +resolves the conflict by looking at the precedence rules and +associativity specifiers. + +1. If the current token has higher precedence than the rule on the stack, it is shifted. + +2. If the grammar rule on the stack has higher precedence, the rule is reduced. + +3. If the current token and the grammar rule have the same precedence, the +rule is reduced for left associativity, whereas the token is shifted for right associativity. + +4. If nothing is known about the precedence, shift/reduce conflicts are resolved in +favor of shifting (the default). + +For example, if "expr PLUS expr" has been parsed and the +next token is "TIMES", the action is going to be a shift because +"TIMES" has a higher precedence level than "PLUS". On the other hand, +if "expr TIMES expr" has been parsed and the next token is +"PLUS", the action is going to be reduce because "PLUS" has a lower +precedence than "TIMES." + +When shift/reduce conflicts are resolved using the first three +techniques (with the help of precedence rules), SLY will +report no errors or conflicts in the grammar. + +One problem with the precedence specifier technique is that it is +sometimes necessary to change the precedence of an operator in certain +contexts. For example, consider a unary-minus operator in "3 + 4 * +-5". Mathematically, the unary minus is normally given a very high +precedence--being evaluated before the multiply. However, in our +precedence specifier, MINUS has a lower precedence than TIMES. To +deal with this, precedence rules can be given for so-called "fictitious tokens" +like this:: + + class CalcParser(Parser): + ... + precedence = ( + ('left', 'PLUS', 'MINUS'), + ('left', 'TIMES', 'DIVIDE'), + ('right', 'UMINUS'), # Unary minus operator + ) + +Now, in the grammar file, we can write our unary minus rule like this:: + + @_('MINUS expr %prec UMINUS') + def expr(p): + p[0] = -p[2] + +In this case, ``%prec UMINUS`` overrides the default rule precedence--setting it to that +of UMINUS in the precedence specifier. + +At first, the use of UMINUS in this example may appear very confusing. +UMINUS is not an input token or a grammar rule. Instead, you should +think of it as the name of a special marker in the precedence table. +When you use the ``%prec`` qualifier, you're simply telling SLY +that you want the precedence of the expression to be the same as for +this special marker instead of the usual precedence. + +It is also possible to specify non-associativity in the ``precedence`` table. This would +be used when you *don't* want operations to chain together. For example, suppose +you wanted to support comparison operators like ``<`` and ``>`` but you didn't want to allow +combinations like ``a < b < c``. To do this, specify a rule like this:: + + class MyParser(Parser): + ... + precedence = ( + ('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative operators + ('left', 'PLUS', 'MINUS'), + ('left', 'TIMES', 'DIVIDE'), + ('right', 'UMINUS'), # Unary minus operator + ) + +If you do this, the occurrence of input text such as ``a < b < c`` +will result in a syntax error. However, simple expressions such as +``a < b`` will still be fine. + +Reduce/reduce conflicts are caused when there are multiple grammar +rules that can be applied to a given set of symbols. This kind of +conflict is almost always bad and is always resolved by picking the +rule that appears first in the grammar file. Reduce/reduce conflicts +are almost always caused when different sets of grammar rules somehow +generate the same set of symbols. For example:: + + + assignment : ID EQUALS NUMBER + | ID EQUALS expr + + expr : expr PLUS expr + | expr MINUS expr + | expr TIMES expr + | expr DIVIDE expr + | LPAREN expr RPAREN + | NUMBER + +In this case, a reduce/reduce conflict exists between these two rules:: + + assignment : ID EQUALS NUMBER + expr : NUMBER + +For example, if you wrote "a = 5", the parser can't figure out if this +is supposed to be reduced as ``assignment : ID EQUALS NUMBER`` or +whether it's supposed to reduce the 5 as an expression and then reduce +the rule ``assignment : ID EQUALS expr``. + +It should be noted that reduce/reduce conflicts are notoriously +difficult to spot simply looking at the input grammar. When a +reduce/reduce conflict occurs, SLY will try to help by +printing a warning message such as this:: + + WARNING: 1 reduce/reduce conflict + WARNING: reduce/reduce conflict in state 15 resolved using rule (assignment -> ID EQUALS NUMBER) + WARNING: rejected rule (expression -> NUMBER) + +This message identifies the two rules that are in conflict. However, +it may not tell you how the parser arrived at such a state. To try +and figure it out, you'll probably have to look at your grammar and +the contents of the parser debugging file with an appropriately high +level of caffeination (see the next section). + +Parser Debugging +^^^^^^^^^^^^^^^^ + +Tracking down shift/reduce and reduce/reduce conflicts is one of the +finer pleasures of using an LR parsing algorithm. To assist in +debugging, SLY creates a debugging file called 'parser.out' when it +generates the parsing table. The contents of this file look like the +following: + +
++ +The different states that appear in this file are a representation of +every possible sequence of valid input tokens allowed by the grammar. +When receiving input tokens, the parser is building up a stack and +looking for matching rules. Each state keeps track of the grammar +rules that might be in the process of being matched at that point. Within each +rule, the "." character indicates the current location of the parse +within that rule. In addition, the actions for each valid input token +are listed. When a shift/reduce or reduce/reduce conflict arises, +rules not selected are prefixed with an !. For example: + ++Unused terminals: + + +Grammar + +Rule 1 expression -> expression PLUS expression +Rule 2 expression -> expression MINUS expression +Rule 3 expression -> expression TIMES expression +Rule 4 expression -> expression DIVIDE expression +Rule 5 expression -> NUMBER +Rule 6 expression -> LPAREN expression RPAREN + +Terminals, with rules where they appear + +TIMES : 3 +error : +MINUS : 2 +RPAREN : 6 +LPAREN : 6 +DIVIDE : 4 +PLUS : 1 +NUMBER : 5 + +Nonterminals, with rules where they appear + +expression : 1 1 2 2 3 3 4 4 6 0 + + +Parsing method: LALR + + +state 0 + + S' -> . expression + expression -> . expression PLUS expression + expression -> . expression MINUS expression + expression -> . expression TIMES expression + expression -> . expression DIVIDE expression + expression -> . NUMBER + expression -> . LPAREN expression RPAREN + + NUMBER shift and go to state 3 + LPAREN shift and go to state 2 + + +state 1 + + S' -> expression . + expression -> expression . PLUS expression + expression -> expression . MINUS expression + expression -> expression . TIMES expression + expression -> expression . DIVIDE expression + + PLUS shift and go to state 6 + MINUS shift and go to state 5 + TIMES shift and go to state 4 + DIVIDE shift and go to state 7 + + +state 2 + + expression -> LPAREN . expression RPAREN + expression -> . expression PLUS expression + expression -> . expression MINUS expression + expression -> . expression TIMES expression + expression -> . expression DIVIDE expression + expression -> . NUMBER + expression -> . LPAREN expression RPAREN + + NUMBER shift and go to state 3 + LPAREN shift and go to state 2 + + +state 3 + + expression -> NUMBER . + + $ reduce using rule 5 + PLUS reduce using rule 5 + MINUS reduce using rule 5 + TIMES reduce using rule 5 + DIVIDE reduce using rule 5 + RPAREN reduce using rule 5 + + +state 4 + + expression -> expression TIMES . expression + expression -> . expression PLUS expression + expression -> . expression MINUS expression + expression -> . expression TIMES expression + expression -> . expression DIVIDE expression + expression -> . NUMBER + expression -> . LPAREN expression RPAREN + + NUMBER shift and go to state 3 + LPAREN shift and go to state 2 + + +state 5 + + expression -> expression MINUS . expression + expression -> . expression PLUS expression + expression -> . expression MINUS expression + expression -> . expression TIMES expression + expression -> . expression DIVIDE expression + expression -> . NUMBER + expression -> . LPAREN expression RPAREN + + NUMBER shift and go to state 3 + LPAREN shift and go to state 2 + + +state 6 + + expression -> expression PLUS . expression + expression -> . expression PLUS expression + expression -> . expression MINUS expression + expression -> . expression TIMES expression + expression -> . expression DIVIDE expression + expression -> . NUMBER + expression -> . LPAREN expression RPAREN + + NUMBER shift and go to state 3 + LPAREN shift and go to state 2 + + +state 7 + + expression -> expression DIVIDE . expression + expression -> . expression PLUS expression + expression -> . expression MINUS expression + expression -> . expression TIMES expression + expression -> . expression DIVIDE expression + expression -> . NUMBER + expression -> . LPAREN expression RPAREN + + NUMBER shift and go to state 3 + LPAREN shift and go to state 2 + + +state 8 + + expression -> LPAREN expression . RPAREN + expression -> expression . PLUS expression + expression -> expression . MINUS expression + expression -> expression . TIMES expression + expression -> expression . DIVIDE expression + + RPAREN shift and go to state 13 + PLUS shift and go to state 6 + MINUS shift and go to state 5 + TIMES shift and go to state 4 + DIVIDE shift and go to state 7 + + +state 9 + + expression -> expression TIMES expression . + expression -> expression . PLUS expression + expression -> expression . MINUS expression + expression -> expression . TIMES expression + expression -> expression . DIVIDE expression + + $ reduce using rule 3 + PLUS reduce using rule 3 + MINUS reduce using rule 3 + TIMES reduce using rule 3 + DIVIDE reduce using rule 3 + RPAREN reduce using rule 3 + + ! PLUS [ shift and go to state 6 ] + ! MINUS [ shift and go to state 5 ] + ! TIMES [ shift and go to state 4 ] + ! DIVIDE [ shift and go to state 7 ] + +state 10 + + expression -> expression MINUS expression . + expression -> expression . PLUS expression + expression -> expression . MINUS expression + expression -> expression . TIMES expression + expression -> expression . DIVIDE expression + + $ reduce using rule 2 + PLUS reduce using rule 2 + MINUS reduce using rule 2 + RPAREN reduce using rule 2 + TIMES shift and go to state 4 + DIVIDE shift and go to state 7 + + ! TIMES [ reduce using rule 2 ] + ! DIVIDE [ reduce using rule 2 ] + ! PLUS [ shift and go to state 6 ] + ! MINUS [ shift and go to state 5 ] + +state 11 + + expression -> expression PLUS expression . + expression -> expression . PLUS expression + expression -> expression . MINUS expression + expression -> expression . TIMES expression + expression -> expression . DIVIDE expression + + $ reduce using rule 1 + PLUS reduce using rule 1 + MINUS reduce using rule 1 + RPAREN reduce using rule 1 + TIMES shift and go to state 4 + DIVIDE shift and go to state 7 + + ! TIMES [ reduce using rule 1 ] + ! DIVIDE [ reduce using rule 1 ] + ! PLUS [ shift and go to state 6 ] + ! MINUS [ shift and go to state 5 ] + +state 12 + + expression -> expression DIVIDE expression . + expression -> expression . PLUS expression + expression -> expression . MINUS expression + expression -> expression . TIMES expression + expression -> expression . DIVIDE expression + + $ reduce using rule 4 + PLUS reduce using rule 4 + MINUS reduce using rule 4 + TIMES reduce using rule 4 + DIVIDE reduce using rule 4 + RPAREN reduce using rule 4 + + ! PLUS [ shift and go to state 6 ] + ! MINUS [ shift and go to state 5 ] + ! TIMES [ shift and go to state 4 ] + ! DIVIDE [ shift and go to state 7 ] + +state 13 + + expression -> LPAREN expression RPAREN . + + $ reduce using rule 6 + PLUS reduce using rule 6 + MINUS reduce using rule 6 + TIMES reduce using rule 6 + DIVIDE reduce using rule 6 + RPAREN reduce using rule 6 ++
++ +By looking at these rules (and with a little practice), you can usually track down the source +of most parsing conflicts. It should also be stressed that not all shift-reduce conflicts are +bad. However, the only way to be sure that they are resolved correctly is to look at parser.out. + ++ ! TIMES [ reduce using rule 2 ] + ! DIVIDE [ reduce using rule 2 ] + ! PLUS [ shift and go to state 6 ] + ! MINUS [ shift and go to state 5 ] ++
+When a syntax error occurs, yacc.py performs the following steps: + +
+
+
+
+
+
++ +To account for the possibility of a bad expression, you might write an additional grammar rule like this: + ++def p_statement_print(p): + 'statement : PRINT expr SEMI' + ... ++
++ +In this case, the error token will match any sequence of +tokens that might appear up to the first semicolon that is +encountered. Once the semicolon is reached, the rule will be +invoked and the error token will go away. + ++def p_statement_print_error(p): + 'statement : PRINT error SEMI' + print("Syntax error in print statement. Bad expression") + ++
+This type of recovery is sometimes known as parser resynchronization. +The error token acts as a wildcard for any bad input text and +the token immediately following error acts as a +synchronization token. + +
+It is important to note that the error token usually does not appear as the last token +on the right in an error rule. For example: + +
++ +This is because the first bad token encountered will cause the rule to +be reduced--which may make it difficult to recover if more bad tokens +immediately follow. + ++def p_statement_print_error(p): + 'statement : PRINT error' + print("Syntax error in print statement. Bad expression") ++
+Panic mode recovery is implemented entirely in the p_error() function. For example, this +function starts discarding tokens until it reaches a closing '}'. Then, it restarts the +parser in its initial state. + +
++ ++def p_error(p): + print("Whoa. You are seriously hosed.") + if not p: + print("End of File!") + return + + # Read ahead looking for a closing '}' + while True: + tok = parser.token() # Get the next token + if not tok or tok.type == 'RBRACE': + break + parser.restart() ++
+This function simply discards the bad token and tells the parser that the error was ok. + +
++ ++def p_error(p): + if p: + print("Syntax error at token", p.type) + # Just discard the token and tell the parser it's okay. + parser.errok() + else: + print("Syntax error at EOF") ++
+More information on these methods is as follows: +
+ ++
+
+
+To supply the next lookahead token to the parser, p_error() can return a token. This might be +useful if trying to synchronize on special characters. For example: + +
++ ++def p_error(p): + # Read ahead looking for a terminating ";" + while True: + tok = parser.token() # Get the next token + if not tok or tok.type == 'SEMI': break + parser.errok() + + # Return SEMI to the parser as the next lookahead token + return tok ++
+Keep in mind in that the above error handling functions, +parser is an instance of the parser created by +yacc(). You'll need to save this instance someplace in your +code so that you can refer to it during error handling. +
+ +++ +The effect of raising SyntaxError is the same as if the last symbol shifted onto the +parsing stack was actually a syntax error. Thus, when you do this, the last symbol shifted is popped off +of the parsing stack and the current lookahead token is set to an error token. The parser +then enters error-recovery mode where it tries to reduce rules that can accept error tokens. +The steps that follow from this point are exactly the same as if a syntax error were detected and +p_error() were called. + ++def p_production(p): + 'production : some production ...' + raise SyntaxError ++
+One important aspect of manually setting an error is that the p_error() function will NOT be +called in this case. If you need to issue an error message, make sure you do it in the production that +raises SyntaxError. + +
+Note: This feature of PLY is meant to mimic the behavior of the YYERROR macro in yacc. + +
+In most cases, yacc will handle errors as soon as a bad input token is +detected on the input. However, be aware that yacc may choose to +delay error handling until after it has reduced one or more grammar +rules first. This behavior might be unexpected, but it's related to +special states in the underlying parsing table known as "defaulted +states." A defaulted state is parsing condition where the same +grammar rule will be reduced regardless of what valid token +comes next on the input. For such states, yacc chooses to go ahead +and reduce the grammar rule without reading the next input +token. If the next token is bad, yacc will eventually get around to reading it and +report a syntax error. It's just a little unusual in that you might +see some of your grammar rules firing immediately prior to the syntax +error. +
+ ++Usually, the delayed error reporting with defaulted states is harmless +(and there are other reasons for wanting PLY to behave in this way). +However, if you need to turn this behavior off for some reason. You +can clear the defaulted states table like this: +
+ +++ ++parser = yacc.yacc() +parser.defaulted_states = {} ++
+Disabling defaulted states is not recommended if your grammar makes use +of embedded actions as described in Section 6.11.
+ +++ +As an optional feature, yacc.py can automatically track line +numbers and positions for all of the grammar symbols as well. +However, this extra tracking requires extra processing and can +significantly slow down parsing. Therefore, it must be enabled by +passing the +tracking=True option to yacc.parse(). For example: + ++def p_expression(p): + 'expression : expression PLUS expression' + line = p.lineno(2) # line number of the PLUS token + index = p.lexpos(2) # Position of the PLUS token ++
++ +Once enabled, the lineno() and lexpos() methods work +for all grammar symbols. In addition, two additional methods can be +used: + ++yacc.parse(data,tracking=True) ++
++ +Note: The lexspan() function only returns the range of values up to the start of the last grammar symbol. + ++def p_expression(p): + 'expression : expression PLUS expression' + p.lineno(1) # Line number of the left expression + p.lineno(2) # line number of the PLUS operator + p.lineno(3) # line number of the right expression + ... + start,end = p.linespan(3) # Start,end lines of the right expression + starti,endi = p.lexspan(3) # Start,end positions of right expression + ++
+Although it may be convenient for PLY to track position information on +all grammar symbols, this is often unnecessary. For example, if you +are merely using line number information in an error message, you can +often just key off of a specific token in the grammar rule. For +example: + +
++ ++def p_bad_func(p): + 'funccall : fname LPAREN error RPAREN' + # Line number reported from LPAREN token + print("Bad function call at line", p.lineno(2)) ++
+Similarly, you may get better parsing performance if you only +selectively propagate line number information where it's needed using +the p.set_lineno() method. For example: + +
++ +PLY doesn't retain line number information from rules that have already been +parsed. If you are building an abstract syntax tree and need to have line numbers, +you should make sure that the line numbers appear in the tree itself. + ++def p_fname(p): + 'fname : ID' + p[0] = p[1] + p.set_lineno(0,p.lineno(1)) ++
A minimal way to construct a tree is to simply create and +propagate a tuple or list in each grammar rule function. There +are many possible ways to do this, but one example would be something +like this: + +
++ ++def p_expression_binop(p): + '''expression : expression PLUS expression + | expression MINUS expression + | expression TIMES expression + | expression DIVIDE expression''' + + p[0] = ('binary-expression',p[2],p[1],p[3]) + +def p_expression_group(p): + 'expression : LPAREN expression RPAREN' + p[0] = ('group-expression',p[2]) + +def p_expression_number(p): + 'expression : NUMBER' + p[0] = ('number-expression',p[1]) ++
+Another approach is to create a set of data structure for different +kinds of abstract syntax tree nodes and assign nodes to p[0] +in each rule. For example: + +
++ +The advantage to this approach is that it may make it easier to attach more complicated +semantics, type checking, code generation, and other features to the node classes. + ++class Expr: pass + +class BinOp(Expr): + def __init__(self,left,op,right): + self.type = "binop" + self.left = left + self.right = right + self.op = op + +class Number(Expr): + def __init__(self,value): + self.type = "number" + self.value = value + +def p_expression_binop(p): + '''expression : expression PLUS expression + | expression MINUS expression + | expression TIMES expression + | expression DIVIDE expression''' + + p[0] = BinOp(p[1],p[2],p[3]) + +def p_expression_group(p): + 'expression : LPAREN expression RPAREN' + p[0] = p[2] + +def p_expression_number(p): + 'expression : NUMBER' + p[0] = Number(p[1]) ++
+To simplify tree traversal, it may make sense to pick a very generic +tree structure for your parse tree nodes. For example: + +
++ ++class Node: + def __init__(self,type,children=None,leaf=None): + self.type = type + if children: + self.children = children + else: + self.children = [ ] + self.leaf = leaf + +def p_expression_binop(p): + '''expression : expression PLUS expression + | expression MINUS expression + | expression TIMES expression + | expression DIVIDE expression''' + + p[0] = Node("binop", [p[1],p[3]], p[2]) ++
++ ++def p_foo(p): + "foo : A B C D" + print("Parsed a foo", p[1],p[2],p[3],p[4]) ++
+In this case, the supplied action code only executes after all of the +symbols A, B, C, and D have been +parsed. Sometimes, however, it is useful to execute small code +fragments during intermediate stages of parsing. For example, suppose +you wanted to perform some action immediately after A has +been parsed. To do this, write an empty rule like this: + +
++ ++def p_foo(p): + "foo : A seen_A B C D" + print("Parsed a foo", p[1],p[3],p[4],p[5]) + print("seen_A returned", p[2]) + +def p_seen_A(p): + "seen_A :" + print("Saw an A = ", p[-1]) # Access grammar symbol to left + p[0] = some_value # Assign value to seen_A + ++
+In this example, the empty seen_A rule executes immediately +after A is shifted onto the parsing stack. Within this +rule, p[-1] refers to the symbol on the stack that appears +immediately to the left of the seen_A symbol. In this case, +it would be the value of A in the foo rule +immediately above. Like other rules, a value can be returned from an +embedded action by simply assigning it to p[0] + +
+The use of embedded actions can sometimes introduce extra shift/reduce conflicts. For example, +this grammar has no conflicts: + +
++ +However, if you insert an embedded action into one of the rules like this, + ++def p_foo(p): + """foo : abcd + | abcx""" + +def p_abcd(p): + "abcd : A B C D" + +def p_abcx(p): + "abcx : A B C X" ++
++ +an extra shift-reduce conflict will be introduced. This conflict is +caused by the fact that the same symbol C appears next in +both the abcd and abcx rules. The parser can either +shift the symbol (abcd rule) or reduce the empty +rule seen_AB (abcx rule). + ++def p_foo(p): + """foo : abcd + | abcx""" + +def p_abcd(p): + "abcd : A B C D" + +def p_abcx(p): + "abcx : A B seen_AB C X" + +def p_seen_AB(p): + "seen_AB :" ++
+A common use of embedded rules is to control other aspects of parsing +such as scoping of local variables. For example, if you were parsing C code, you might +write code like this: + +
++ +In this case, the embedded action new_scope executes +immediately after a LBRACE ({) symbol is parsed. +This might adjust internal symbol tables and other aspects of the +parser. Upon completion of the rule statements_block, code +might undo the operations performed in the embedded action +(e.g., pop_scope()). + ++def p_statements_block(p): + "statements: LBRACE new_scope statements RBRACE""" + # Action code + ... + pop_scope() # Return to previous scope + +def p_new_scope(p): + "new_scope :" + # Create a new scope for local variables + s = new_scope() + push_scope(s) + ... ++
++in this case, x must be a Lexer object that minimally has a x.token() method for retrieving the next +token. If an input string is given to yacc.parse(), the lexer must also have an x.input() method. + ++parser = yacc.parse(lexer=x) ++
+
++ ++parser = yacc.yacc(debug=False) ++
+
++ ++parser = yacc.yacc(tabmodule="foo") ++
+Normally, the parsetab.py file is placed into the same directory as +the module where the parser is defined. If you want it to go somewhere else, you can +given an absolute package name for tabmodule instead. In that case, the +tables will be written there. +
+ ++
++ ++parser = yacc.yacc(tabmodule="foo",outputdir="somedirectory") ++
+Note: Be aware that unless the directory specified is also on Python's path (sys.path), subsequent +imports of the table file will fail. As a general rule, it's better to specify a destination using the +tabmodule argument instead of directly specifying a directory using the outputdir argument. +
+ ++
++ +Note: If you disable table generation, yacc() will regenerate the parsing tables +each time it runs (which may take awhile depending on how large your grammar is). + ++parser = yacc.yacc(write_tables=False) ++
+
++ ++parser = yacc.parse(debug=True) ++
+
+It should be noted that table generation is reasonably efficient, even for grammars that involve around a 100 rules +and several hundred states.
+
+
+
++ ++from functools import wraps +from nodes import Collection + + +def strict(*types): + def decorate(func): + @wraps(func) + def wrapper(p): + func(p) + if not isinstance(p[0], types): + raise TypeError + + wrapper.co_firstlineno = func.__code__.co_firstlineno + return wrapper + + return decorate + +@strict(Collection) +def p_collection(p): + """ + collection : sequence + | map + """ + p[0] = p[1] ++
+As a general rules this isn't a problem. However, to make it work, +you need to carefully make sure everything gets hooked up correctly. +First, make sure you save the objects returned by lex() and +yacc(). For example: + +
++ +Next, when parsing, make sure you give the parse() function a reference to the lexer it +should be using. For example: + ++lexer = lex.lex() # Return lexer object +parser = yacc.yacc() # Return parser object ++
++ +If you forget to do this, the parser will use the last lexer +created--which is not always what you want. + ++parser.parse(text,lexer=lexer) ++
+Within lexer and parser rule functions, these objects are also +available. In the lexer, the "lexer" attribute of a token refers to +the lexer object that triggered the rule. For example: + +
++ +In the parser, the "lexer" and "parser" attributes refer to the lexer +and parser objects respectively. + ++def t_NUMBER(t): + r'\d+' + ... + print(t.lexer) # Show lexer object ++
++ +If necessary, arbitrary attributes can be attached to the lexer or parser object. +For example, if you wanted to have different parsing modes, you could attach a mode +attribute to the parser object and look at it later. + ++def p_expr_plus(p): + 'expr : expr PLUS expr' + ... + print(p.parser) # Show parser object + print(p.lexer) # Show lexer object ++
++ +then PLY can later be used when Python runs in optimized mode. To make this work, +make sure you first run Python in normal mode. Once the lexing and parsing tables +have been generated the first time, run Python in optimized mode. PLY will use +the tables without the need for doc strings. + ++lex.lex(optimize=1) +yacc.yacc(optimize=1) ++
+Beware: running PLY in optimized mode disables a lot of error +checking. You should only do this when your project has stabilized +and you don't need to do any debugging. One of the purposes of +optimized mode is to substantially decrease the startup time of +your compiler (by assuming that everything is already properly +specified and works). + +
+Debugging a compiler is typically not an easy task. PLY provides some +advanced diagostic capabilities through the use of Python's +logging module. The next two sections describe this: + +
+Both the lex() and yacc() commands have a debugging +mode that can be enabled using the debug flag. For example: + +
++ +Normally, the output produced by debugging is routed to either +standard error or, in the case of yacc(), to a file +parser.out. This output can be more carefully controlled +by supplying a logging object. Here is an example that adds +information about where different debugging messages are coming from: + ++lex.lex(debug=True) +yacc.yacc(debug=True) ++
++ +If you supply a custom logger, the amount of debugging +information produced can be controlled by setting the logging level. +Typically, debugging messages are either issued at the DEBUG, +INFO, or WARNING levels. + ++# Set up a logging object +import logging +logging.basicConfig( + level = logging.DEBUG, + filename = "parselog.txt", + filemode = "w", + format = "%(filename)10s:%(lineno)4d:%(message)s" +) +log = logging.getLogger() + +lex.lex(debug=True,debuglog=log) +yacc.yacc(debug=True,debuglog=log) ++
+PLY's error messages and warnings are also produced using the logging +interface. This can be controlled by passing a logging object +using the errorlog parameter. + +
++ +If you want to completely silence warnings, you can either pass in a +logging object with an appropriate filter level or use the NullLogger +object defined in either lex or yacc. For example: + ++lex.lex(errorlog=log) +yacc.yacc(errorlog=log) ++
++ ++yacc.yacc(errorlog=yacc.NullLogger()) ++
+To enable run-time debugging of a parser, use the debug option to parse. This +option can either be an integer (which simply turns debugging on or off) or an instance +of a logger object. For example: + +
++ +If a logging object is passed, you can use its filtering level to control how much +output gets generated. The INFO level is used to produce information +about rule reductions. The DEBUG level will show information about the +parsing stack, token shifts, and other details. The ERROR level shows information +related to parsing errors. + ++log = logging.getLogger() +parser.parse(input,debug=log) ++
+For very complicated problems, you should pass in a logging object that +redirects to a file where you can more easily inspect the output after +execution. + +