diff --git a/docs/sly.rst b/docs/sly.rst index 32b43a1..8d56f41 100644 --- a/docs/sly.rst +++ b/docs/sly.rst @@ -375,29 +375,6 @@ parser is often a hard problem. An error handler might scan ahead to a logical synchronization point such as a semicolon, a blank line, or similar landmark. -EOF Handling -^^^^^^^^^^^^ - -The lexer will produce tokens until it reaches the end of the supplied -input string. An optional ``eof()`` method can be used to handle an -end-of-file (EOF) condition in the input. For example:: - - class MyLexer(Lexer): - ... - # EOF handling rule - def eof(self): - # Get more input (Example) - more = input('more > ') - return more - -The ``eof()`` method should return a string as a result. Be aware -that reading input in chunks may require great attention to the -handling of chunk boundaries. Specifically, you can't break the text -such that a chunk boundary appears in the middle of a token (for -example, splitting input in the middle of a quoted string). For -this reason, you might have to do some additional framing -of the data such as splitting into lines or blocks to make it work. - Maintaining extra state ^^^^^^^^^^^^^^^^^^^^^^^ @@ -421,3 +398,1780 @@ example:: Please note that lexers already use the ``lineno`` and ``position`` attributes during parsing. +Writing a Parser +---------------- + +The ``Parser`` class is used to parse language syntax. Before showing +an example, there are a few important bits of background that must be +mentioned. First, *syntax* is usually specified in terms of a BNF +grammar. For example, if you wanted to parse simple arithmetic +expressions, you might first write an unambiguous grammar +specification like this:: + + expr : expr + term + | expr - term + | term + + term : term * factor + | term / factor + | factor + + factor : NUMBER + | ( expr ) + +In the grammar, symbols such as ``NUMBER``, ``+``, ``-``, ``*``, and +``/`` are known as *terminals* and correspond to raw input tokens. +Identifiers such as ``term`` and ``factor`` refer to grammar rules +comprised of a collection of terminals and other rules. These +identifiers are known as *non-terminals*. + +The semantic behavior of a language is often specified using a +technique known as syntax directed translation. In syntax directed +translation, attributes are attached to each symbol in a given grammar +rule along with an action. Whenever a particular grammar rule is +recognized, the action describes what to do. For example, given the +expression grammar above, you might write the specification for a +simple calculator like this:: + + Grammar Action + ------------------------ -------------------------------- + expr0 : expr1 + term expr0.val = expr1.val + term.val + | expr1 - term expr0.val = expr1.val - term.val + | term expr0.val = term.val + + term0 : term1 * factor term0.val = term1.val * factor.val + | term1 / factor term0.val = term1.val / factor.val + | factor term0.val = factor.val + + factor : NUMBER factor.val = int(NUMBER.val) + | ( expr ) factor.val = expr.val + +A good way to think about syntax directed translation is to view each +symbol in the grammar as a kind of object. Associated with each symbol +is a value representing its "state" (for example, the ``val`` +attribute above). Semantic actions are then expressed as a collection +of functions or methods that operate on the symbols and associated +values. + +SLY uses a parsing technique known as LR-parsing or shift-reduce +parsing. LR parsing is a bottom up technique that tries to recognize +the right-hand-side of various grammar rules. Whenever a valid +right-hand-side is found in the input, the appropriate action code is +triggered and the grammar symbols are replaced by the grammar symbol +on the left-hand-side. + +LR parsing is commonly implemented by shifting grammar symbols onto a +stack and looking at the stack and the next input token for patterns +that match one of the grammar rules. The details of the algorithm can +be found in a compiler textbook, but the following example illustrates +the steps that are performed if you wanted to parse the expression ``3 ++ 5 * (10 - 20)`` using the grammar defined above. In the example, +the special symbol ``$`` represents the end of input:: + + Step Symbol Stack Input Tokens Action + ---- --------------------- --------------------- ------------------------------- + 1 3 + 5 * ( 10 - 20 )$ Shift 3 + 2 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER + 3 factor + 5 * ( 10 - 20 )$ Reduce term : factor + 4 term + 5 * ( 10 - 20 )$ Reduce expr : term + 5 expr + 5 * ( 10 - 20 )$ Shift + + 6 expr + 5 * ( 10 - 20 )$ Shift 5 + 7 expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER + 8 expr + factor * ( 10 - 20 )$ Reduce term : factor + 9 expr + term * ( 10 - 20 )$ Shift * + 10 expr + term * ( 10 - 20 )$ Shift ( + 11 expr + term * ( 10 - 20 )$ Shift 10 + 12 expr + term * ( 10 - 20 )$ Reduce factor : NUMBER + 13 expr + term * ( factor - 20 )$ Reduce term : factor + 14 expr + term * ( term - 20 )$ Reduce expr : term + 15 expr + term * ( expr - 20 )$ Shift - + 16 expr + term * ( expr - 20 )$ Shift 20 + 17 expr + term * ( expr - 20 )$ Reduce factor : NUMBER + 18 expr + term * ( expr - factor )$ Reduce term : factor + 19 expr + term * ( expr - term )$ Reduce expr : expr - term + 20 expr + term * ( expr )$ Shift ) + 21 expr + term * ( expr ) $ Reduce factor : (expr) + 22 expr + term * factor $ Reduce term : term * factor + 23 expr + term $ Reduce expr : expr + term + 24 expr $ Reduce expr + 25 $ Success! + +When parsing the expression, an underlying state machine and the +current input token determine what happens next. If the next token +looks like part of a valid grammar rule (based on other items on the +stack), it is generally shifted onto the stack. If the top of the +stack contains a valid right-hand-side of a grammar rule, it is +usually "reduced" and the symbols replaced with the symbol on the +left-hand-side. When this reduction occurs, the appropriate action is +triggered (if defined). If the input token can't be shifted and the +top of stack doesn't match any grammar rules, a syntax error has +occurred and the parser must take some kind of recovery step (or bail +out). A parse is only successful if the parser reaches a state where +the symbol stack is empty and there are no more input tokens. + +It is important to note that the underlying implementation is built +around a large finite-state machine that is encoded in a collection of +tables. The construction of these tables is non-trivial and +beyond the scope of this discussion. However, subtle details of this +process explain why, in the example above, the parser chooses to shift +a token onto the stack in step 9 rather than reducing the +rule ``expr : expr + term``. + +Parsing Example +^^^^^^^^^^^^^^^ +Suppose you wanted to make a grammar for simple arithmetic expressions as previously described. +Here is how you would do it with SLY:: + + + from sly import Parser + from calclex import CalcLexer + + class CalcParser(Parser): + # Get the token list from the lexer (required) + tokens = CalcLexer.tokens + + # Grammar rules and actions + @_('expr PLUS term') + def expr(self, p): + p[0] = p[1] + p[3] + + @_('expr MINUS term') + def expr(self, p): + p[0] = p[1] - p[3] + + @_('term') + def expr(self, p): + p[0] = p[1] + + @_('term TIMES factor') + def term(self, p): + p[0] = p[1] * p[3] + + @_('term DIVIDE factor') + def term(self, p): + p[0] = p[1] / p[3] + + @_('factor') + def term(self, p): + p[0] = p[1] + + @_('NUMBER') + def factor(self, p): + p[0] = p[1] + + @_('LPAREN expr RPAREN') + def factor(self, p): + p[0] = p[2] + + # Error rule for syntax errors + def error(self, p): + print("Syntax error in input!") + + if __name__ == '__main__': + lexer = CalcLexer() + parser = CalcParser() + + while True: + try: + text = input('calc > ') + result = parser.parse(lexer.tokenize(text)) + print(result) + except EOFError: + break + +In this example, each grammar rule is defined by a Python function +where the docstring to that function contains the appropriate +context-free grammar specification. The statements that make up the +function body implement the semantic actions of the rule. Each function +accepts a single argument ``p`` that is a sequence containing the +values of each grammar symbol in the corresponding rule. The values +of ``p[i]`` are mapped to grammar symbols as shown here:: + + # p[1] p[2] p[3] + # | | | + @_('expr PLUS term') + def expr(self, p): + p[0] = p[1] + p[3] + +For tokens, the "value" of the corresponding ``p[i]`` is the +*same* as the ``p.value`` attribute assigned in the lexer +module. For non-terminals, the value is determined by whatever is +placed in ``p[0]`` when rules are reduced. This value can be +anything at all. However, it probably most common for the value to be +a simple Python type, a tuple, or an instance. In this example, we +are relying on the fact that the ``NUMBER`` token stores an +integer value in its value field. All of the other rules simply +perform various types of integer operations and propagate the result. + +Note: The use of negative indices have a special meaning in +yacc---specially ``p[-1]`` does not have the same value +as ``p[3]`` in this example. Please see the section on "Embedded +Actions" for further details. + +The first rule defined in the yacc specification determines the +starting grammar symbol (in this case, a rule for ``expr`` +appears first). Whenever the starting rule is reduced by the parser +and no more input is available, parsing stops and the final value is +returned (this value will be whatever the top-most rule placed +in ``p[0]``). Note: an alternative starting symbol can be +specified using the ``start`` attribute in the class. + +The ``error()`` method is defined to catch syntax errors. +See the error handling section below for more detail. + +If any errors are detected in your grammar specification, SLY will +produce diagnostic messages and possibly raise an exception. Some of +the errors that can be detected include: + +- Duplicated grammar rules +- Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars. +- Badly specified grammar rules. +- Infinite recursion (rules that can never terminate). +- Unused rules and tokens +- Undefined rules and tokens + +The final part of the example shows how to actually run the parser. +To run the parser, you simply have to call the ``parse()`` method with +a sequence of the input tokens. This will run all of the grammar +rules and return the result of the entire parse. This result return +is the value assigned to ``p[0]`` in the starting grammar rule. + +Combining Grammar Rule Functions +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When grammar rules are similar, they can be combined into a single function. +For example, consider the two rules in our earlier example:: + + @_('expr PLUS term') + def expr(self, p): + p[0] = p[1] + p[3] + + @_('expr MINUS term') + def expr(self, p): + p[0] = p[1] - p[3] + +Instead of writing two functions, you might write a single function like this: + + @_('expr PLUS term', + 'expr MINUS term') + def expr(self, p): + if p[2] == '+': + p[0] = p[1] + p[3] + elif p[2] == '-': + p[0] = p[1] - p[3] + +In general, the ``@_()`` decorator for any given method can list +multiple grammar rules. When combining grammar rules into a single +function though, it is usually a good idea for all of the rules to have a +similar structure (e.g., the same number of terms). Otherwise, the +corresponding action code may be more complicated than necessary. +However, it is possible to handle simple cases using len(). For +example: + + @_('expr MINUS expr', + 'MINUS expression') + def expr(self, p): + if (len(p) == 4): + p[0] = p[1] - p[3] + elif (len(p) == 3): + p[0] = -p[2] + +If parsing performance is a concern, you should resist the urge to put +too much conditional processing into a single grammar rule as shown in +these examples. When you add checks to see which grammar rule is +being handled, you are actually duplicating the work that the parser +has already performed (i.e., the parser already knows exactly what rule it +matched). You can eliminate this overhead by using a +separate method for each grammar rule. + +Character Literals +^^^^^^^^^^^^^^^^^^ + +If desired, a grammar may contain tokens defined as single character +literals. For example:: + + @_('expr "+" term') + def expr(self, p): + p[0] = p[1] + p[3] + + @_('expr "-" term') + def expr(self, p): + p[0] = p[1] - p[3] + +A character literal must be enclosed in quotes such as ``"+"``. In addition, if literals are used, they must be declared in the +corresponding lexer class through the use of a special ``literals`` declaration:: + + class CalcLexer(Lexer): + ... + literals = ['+','-','*','/' ] + ... + +Character literals are limited to a single character. Thus, it is not +legal to specify literals such as ``<=`` or ``==``. +For this, use the normal lexing rules (e.g., define a rule such as +``EQ = r'=='``). + +Empty Productions +^^^^^^^^^^^^^^^^^ + +If you need an empty production, define a special rule like this:: + + @_('') + def empty(self, p): + pass + +Now to use the empty production, simply use 'empty' as a symbol. For +example:: + + @_('item') + def optitem(self, p): + ... + + @_('empty') + def optitem(self, p): + ... + +Note: You can write empty rules anywhere by simply specifying an empty +string. However, I personally find that writing an "empty" +rule and using "empty" to denote an empty production is easier to read +and more clearly states your intentions. + +Changing the starting symbol +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Normally, the first rule found in a yacc specification defines the +starting grammar rule (top level rule). To change this, supply +a ``start`` specifier in your file. For example:: + + class CalcParser(Parser): + start = 'foo' + + @_('A B') + def bar(self, p): + ... + + @_('bar X') + def foo(self, p): # Parsing starts here (start symbol above) + ... + +The use of a ``start`` specifier may be useful during debugging +since you can use it to work with a subset of a larger grammar. + +Dealing With Ambiguous Grammars +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The expression grammar given in the earlier example has been written +in a special format to eliminate ambiguity. However, in many +situations, it is extremely difficult or awkward to write grammars in +this format. A much more natural way to express the grammar is in a +more compact form like this:: + +expr : expr PLUS expr + | expr MINUS expr + | expr TIMES expr + | expr DIVIDE expr + | LPAREN expr RPAREN + | NUMBER + +Unfortunately, this grammar specification is ambiguous. For example, +if you are parsing the string "3 * 4 + 5", there is no way to tell how +the operators are supposed to be grouped. For example, does the +expression mean "(3 * 4) + 5" or is it "3 * (4+5)"? + +When an ambiguous grammar is given, you will get messages about +"shift/reduce conflicts" or "reduce/reduce conflicts". A shift/reduce +conflict is caused when the parser generator can't decide whether or +not to reduce a rule or shift a symbol on the parsing stack. For +example, consider the string "3 * 4 + 5" and the internal parsing +stack:: + + Step Symbol Stack Input Tokens Action + ---- --------------------- --------------------- ------------------------------- + 1 $ 3 * 4 + 5$ Shift 3 + 2 $ 3 * 4 + 5$ Reduce : expr : NUMBER + 3 $ expr * 4 + 5$ Shift * + 4 $ expr * 4 + 5$ Shift 4 + 5 $ expr * 4 + 5$ Reduce: expr : NUMBER + 6 $ expr * expr + 5$ SHIFT/REDUCE CONFLICT ???? + +In this case, when the parser reaches step 6, it has two options. One +is to reduce the rule ``expr : expr * expr`` on the stack. The other +option is to shift the token ``+`` on the stack. Both options are +perfectly legal from the rules of the context-free-grammar. + +By default, all shift/reduce conflicts are resolved in favor of +shifting. Therefore, in the above example, the parser will always +shift the ``+`` instead of reducing. Although this strategy +works in many cases (for example, the case of +"if-then" versus "if-then-else"), it is not enough for arithmetic expressions. In fact, +in the above example, the decision to shift ``+`` is completely +wrong---we should have reduced ``expr * expr`` since +multiplication has higher mathematical precedence than addition. + +To resolve ambiguity, especially in expression +grammars, SLY allows individual tokens to be assigned a +precedence level and associativity. This is done by adding a variable +``precedence`` to the grammar file like this:: + + class CalcParser(Parser): + ... + precedence = ( + ('left', 'PLUS', 'MINUS'), + ('left', 'TIMES', 'DIVIDE'), + ) + ... + +This declaration specifies that ``PLUS``/``MINUS`` have the +same precedence level and are left-associative and that +``TIMES``/``DIVIDE`` have the same precedence and are +left-associative. Within the ``precedence`` declaration, tokens +are ordered from lowest to highest precedence. Thus, this declaration +specifies that ``TIMES``/``DIVIDE`` have higher precedence +than ``PLUS``/``MINUS`` (since they appear later in the +precedence specification). + +The precedence specification works by associating a numerical +precedence level value and associativity direction to the listed +tokens. For example, in the above example you get:: + + PLUS : level = 1, assoc = 'left' + MINUS : level = 1, assoc = 'left' + TIMES : level = 2, assoc = 'left' + DIVIDE : level = 2, assoc = 'left' + +These values are then used to attach a numerical precedence value and +associativity direction to each grammar rule. *This is always +determined by looking at the precedence of the right-most terminal +symbol.* For example:: + + expr : expr PLUS expr # level = 1, left + | expr MINUS expr # level = 1, left + | expr TIMES expr # level = 2, left + | expr DIVIDE expr # level = 2, left + | LPAREN expr RPAREN # level = None (not specified) + | NUMBER # level = None (not specified) + +When shift/reduce conflicts are encountered, the parser generator +resolves the conflict by looking at the precedence rules and +associativity specifiers. + +1. If the current token has higher precedence than the rule on the stack, it is shifted. + +2. If the grammar rule on the stack has higher precedence, the rule is reduced. + +3. If the current token and the grammar rule have the same precedence, the +rule is reduced for left associativity, whereas the token is shifted for right associativity. + +4. If nothing is known about the precedence, shift/reduce conflicts are resolved in +favor of shifting (the default). + +For example, if "expr PLUS expr" has been parsed and the +next token is "TIMES", the action is going to be a shift because +"TIMES" has a higher precedence level than "PLUS". On the other hand, +if "expr TIMES expr" has been parsed and the next token is +"PLUS", the action is going to be reduce because "PLUS" has a lower +precedence than "TIMES." + +When shift/reduce conflicts are resolved using the first three +techniques (with the help of precedence rules), SLY will +report no errors or conflicts in the grammar. + +One problem with the precedence specifier technique is that it is +sometimes necessary to change the precedence of an operator in certain +contexts. For example, consider a unary-minus operator in "3 + 4 * +-5". Mathematically, the unary minus is normally given a very high +precedence--being evaluated before the multiply. However, in our +precedence specifier, MINUS has a lower precedence than TIMES. To +deal with this, precedence rules can be given for so-called "fictitious tokens" +like this:: + + class CalcParser(Parser): + ... + precedence = ( + ('left', 'PLUS', 'MINUS'), + ('left', 'TIMES', 'DIVIDE'), + ('right', 'UMINUS'), # Unary minus operator + ) + +Now, in the grammar file, we can write our unary minus rule like this:: + + @_('MINUS expr %prec UMINUS') + def expr(p): + p[0] = -p[2] + +In this case, ``%prec UMINUS`` overrides the default rule precedence--setting it to that +of UMINUS in the precedence specifier. + +At first, the use of UMINUS in this example may appear very confusing. +UMINUS is not an input token or a grammar rule. Instead, you should +think of it as the name of a special marker in the precedence table. +When you use the ``%prec`` qualifier, you're simply telling SLY +that you want the precedence of the expression to be the same as for +this special marker instead of the usual precedence. + +It is also possible to specify non-associativity in the ``precedence`` table. This would +be used when you *don't* want operations to chain together. For example, suppose +you wanted to support comparison operators like ``<`` and ``>`` but you didn't want to allow +combinations like ``a < b < c``. To do this, specify a rule like this:: + + class MyParser(Parser): + ... + precedence = ( + ('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative operators + ('left', 'PLUS', 'MINUS'), + ('left', 'TIMES', 'DIVIDE'), + ('right', 'UMINUS'), # Unary minus operator + ) + +If you do this, the occurrence of input text such as ``a < b < c`` +will result in a syntax error. However, simple expressions such as +``a < b`` will still be fine. + +Reduce/reduce conflicts are caused when there are multiple grammar +rules that can be applied to a given set of symbols. This kind of +conflict is almost always bad and is always resolved by picking the +rule that appears first in the grammar file. Reduce/reduce conflicts +are almost always caused when different sets of grammar rules somehow +generate the same set of symbols. For example:: + + + assignment : ID EQUALS NUMBER + | ID EQUALS expr + + expr : expr PLUS expr + | expr MINUS expr + | expr TIMES expr + | expr DIVIDE expr + | LPAREN expr RPAREN + | NUMBER + +In this case, a reduce/reduce conflict exists between these two rules:: + + assignment : ID EQUALS NUMBER + expr : NUMBER + +For example, if you wrote "a = 5", the parser can't figure out if this +is supposed to be reduced as ``assignment : ID EQUALS NUMBER`` or +whether it's supposed to reduce the 5 as an expression and then reduce +the rule ``assignment : ID EQUALS expr``. + +It should be noted that reduce/reduce conflicts are notoriously +difficult to spot simply looking at the input grammar. When a +reduce/reduce conflict occurs, SLY will try to help by +printing a warning message such as this:: + + WARNING: 1 reduce/reduce conflict + WARNING: reduce/reduce conflict in state 15 resolved using rule (assignment -> ID EQUALS NUMBER) + WARNING: rejected rule (expression -> NUMBER) + +This message identifies the two rules that are in conflict. However, +it may not tell you how the parser arrived at such a state. To try +and figure it out, you'll probably have to look at your grammar and +the contents of the parser debugging file with an appropriately high +level of caffeination (see the next section). + +Parser Debugging +^^^^^^^^^^^^^^^^ + +Tracking down shift/reduce and reduce/reduce conflicts is one of the +finer pleasures of using an LR parsing algorithm. To assist in +debugging, SLY creates a debugging file called 'parser.out' when it +generates the parsing table. The contents of this file look like the +following: + +
+
+Unused terminals:
+
+
+Grammar
+
+Rule 1     expression -> expression PLUS expression
+Rule 2     expression -> expression MINUS expression
+Rule 3     expression -> expression TIMES expression
+Rule 4     expression -> expression DIVIDE expression
+Rule 5     expression -> NUMBER
+Rule 6     expression -> LPAREN expression RPAREN
+
+Terminals, with rules where they appear
+
+TIMES                : 3
+error                : 
+MINUS                : 2
+RPAREN               : 6
+LPAREN               : 6
+DIVIDE               : 4
+PLUS                 : 1
+NUMBER               : 5
+
+Nonterminals, with rules where they appear
+
+expression           : 1 1 2 2 3 3 4 4 6 0
+
+
+Parsing method: LALR
+
+
+state 0
+
+    S' -> . expression
+    expression -> . expression PLUS expression
+    expression -> . expression MINUS expression
+    expression -> . expression TIMES expression
+    expression -> . expression DIVIDE expression
+    expression -> . NUMBER
+    expression -> . LPAREN expression RPAREN
+
+    NUMBER          shift and go to state 3
+    LPAREN          shift and go to state 2
+
+
+state 1
+
+    S' -> expression .
+    expression -> expression . PLUS expression
+    expression -> expression . MINUS expression
+    expression -> expression . TIMES expression
+    expression -> expression . DIVIDE expression
+
+    PLUS            shift and go to state 6
+    MINUS           shift and go to state 5
+    TIMES           shift and go to state 4
+    DIVIDE          shift and go to state 7
+
+
+state 2
+
+    expression -> LPAREN . expression RPAREN
+    expression -> . expression PLUS expression
+    expression -> . expression MINUS expression
+    expression -> . expression TIMES expression
+    expression -> . expression DIVIDE expression
+    expression -> . NUMBER
+    expression -> . LPAREN expression RPAREN
+
+    NUMBER          shift and go to state 3
+    LPAREN          shift and go to state 2
+
+
+state 3
+
+    expression -> NUMBER .
+
+    $               reduce using rule 5
+    PLUS            reduce using rule 5
+    MINUS           reduce using rule 5
+    TIMES           reduce using rule 5
+    DIVIDE          reduce using rule 5
+    RPAREN          reduce using rule 5
+
+
+state 4
+
+    expression -> expression TIMES . expression
+    expression -> . expression PLUS expression
+    expression -> . expression MINUS expression
+    expression -> . expression TIMES expression
+    expression -> . expression DIVIDE expression
+    expression -> . NUMBER
+    expression -> . LPAREN expression RPAREN
+
+    NUMBER          shift and go to state 3
+    LPAREN          shift and go to state 2
+
+
+state 5
+
+    expression -> expression MINUS . expression
+    expression -> . expression PLUS expression
+    expression -> . expression MINUS expression
+    expression -> . expression TIMES expression
+    expression -> . expression DIVIDE expression
+    expression -> . NUMBER
+    expression -> . LPAREN expression RPAREN
+
+    NUMBER          shift and go to state 3
+    LPAREN          shift and go to state 2
+
+
+state 6
+
+    expression -> expression PLUS . expression
+    expression -> . expression PLUS expression
+    expression -> . expression MINUS expression
+    expression -> . expression TIMES expression
+    expression -> . expression DIVIDE expression
+    expression -> . NUMBER
+    expression -> . LPAREN expression RPAREN
+
+    NUMBER          shift and go to state 3
+    LPAREN          shift and go to state 2
+
+
+state 7
+
+    expression -> expression DIVIDE . expression
+    expression -> . expression PLUS expression
+    expression -> . expression MINUS expression
+    expression -> . expression TIMES expression
+    expression -> . expression DIVIDE expression
+    expression -> . NUMBER
+    expression -> . LPAREN expression RPAREN
+
+    NUMBER          shift and go to state 3
+    LPAREN          shift and go to state 2
+
+
+state 8
+
+    expression -> LPAREN expression . RPAREN
+    expression -> expression . PLUS expression
+    expression -> expression . MINUS expression
+    expression -> expression . TIMES expression
+    expression -> expression . DIVIDE expression
+
+    RPAREN          shift and go to state 13
+    PLUS            shift and go to state 6
+    MINUS           shift and go to state 5
+    TIMES           shift and go to state 4
+    DIVIDE          shift and go to state 7
+
+
+state 9
+
+    expression -> expression TIMES expression .
+    expression -> expression . PLUS expression
+    expression -> expression . MINUS expression
+    expression -> expression . TIMES expression
+    expression -> expression . DIVIDE expression
+
+    $               reduce using rule 3
+    PLUS            reduce using rule 3
+    MINUS           reduce using rule 3
+    TIMES           reduce using rule 3
+    DIVIDE          reduce using rule 3
+    RPAREN          reduce using rule 3
+
+  ! PLUS            [ shift and go to state 6 ]
+  ! MINUS           [ shift and go to state 5 ]
+  ! TIMES           [ shift and go to state 4 ]
+  ! DIVIDE          [ shift and go to state 7 ]
+
+state 10
+
+    expression -> expression MINUS expression .
+    expression -> expression . PLUS expression
+    expression -> expression . MINUS expression
+    expression -> expression . TIMES expression
+    expression -> expression . DIVIDE expression
+
+    $               reduce using rule 2
+    PLUS            reduce using rule 2
+    MINUS           reduce using rule 2
+    RPAREN          reduce using rule 2
+    TIMES           shift and go to state 4
+    DIVIDE          shift and go to state 7
+
+  ! TIMES           [ reduce using rule 2 ]
+  ! DIVIDE          [ reduce using rule 2 ]
+  ! PLUS            [ shift and go to state 6 ]
+  ! MINUS           [ shift and go to state 5 ]
+
+state 11
+
+    expression -> expression PLUS expression .
+    expression -> expression . PLUS expression
+    expression -> expression . MINUS expression
+    expression -> expression . TIMES expression
+    expression -> expression . DIVIDE expression
+
+    $               reduce using rule 1
+    PLUS            reduce using rule 1
+    MINUS           reduce using rule 1
+    RPAREN          reduce using rule 1
+    TIMES           shift and go to state 4
+    DIVIDE          shift and go to state 7
+
+  ! TIMES           [ reduce using rule 1 ]
+  ! DIVIDE          [ reduce using rule 1 ]
+  ! PLUS            [ shift and go to state 6 ]
+  ! MINUS           [ shift and go to state 5 ]
+
+state 12
+
+    expression -> expression DIVIDE expression .
+    expression -> expression . PLUS expression
+    expression -> expression . MINUS expression
+    expression -> expression . TIMES expression
+    expression -> expression . DIVIDE expression
+
+    $               reduce using rule 4
+    PLUS            reduce using rule 4
+    MINUS           reduce using rule 4
+    TIMES           reduce using rule 4
+    DIVIDE          reduce using rule 4
+    RPAREN          reduce using rule 4
+
+  ! PLUS            [ shift and go to state 6 ]
+  ! MINUS           [ shift and go to state 5 ]
+  ! TIMES           [ shift and go to state 4 ]
+  ! DIVIDE          [ shift and go to state 7 ]
+
+state 13
+
+    expression -> LPAREN expression RPAREN .
+
+    $               reduce using rule 6
+    PLUS            reduce using rule 6
+    MINUS           reduce using rule 6
+    TIMES           reduce using rule 6
+    DIVIDE          reduce using rule 6
+    RPAREN          reduce using rule 6
+
+
+ +The different states that appear in this file are a representation of +every possible sequence of valid input tokens allowed by the grammar. +When receiving input tokens, the parser is building up a stack and +looking for matching rules. Each state keeps track of the grammar +rules that might be in the process of being matched at that point. Within each +rule, the "." character indicates the current location of the parse +within that rule. In addition, the actions for each valid input token +are listed. When a shift/reduce or reduce/reduce conflict arises, +rules not selected are prefixed with an !. For example: + +
+
+  ! TIMES           [ reduce using rule 2 ]
+  ! DIVIDE          [ reduce using rule 2 ]
+  ! PLUS            [ shift and go to state 6 ]
+  ! MINUS           [ shift and go to state 5 ]
+
+
+ +By looking at these rules (and with a little practice), you can usually track down the source +of most parsing conflicts. It should also be stressed that not all shift-reduce conflicts are +bad. However, the only way to be sure that they are resolved correctly is to look at parser.out. + +

6.8 Syntax Error Handling

+ + +If you are creating a parser for production use, the handling of +syntax errors is important. As a general rule, you don't want a +parser to simply throw up its hands and stop at the first sign of +trouble. Instead, you want it to report the error, recover if possible, and +continue parsing so that all of the errors in the input get reported +to the user at once. This is the standard behavior found in compilers +for languages such as C, C++, and Java. + +In PLY, when a syntax error occurs during parsing, the error is immediately +detected (i.e., the parser does not read any more tokens beyond the +source of the error). However, at this point, the parser enters a +recovery mode that can be used to try and continue further parsing. +As a general rule, error recovery in LR parsers is a delicate +topic that involves ancient rituals and black-magic. The recovery mechanism +provided by yacc.py is comparable to Unix yacc so you may want +consult a book like O'Reilly's "Lex and Yacc" for some of the finer details. + +

+When a syntax error occurs, yacc.py performs the following steps: + +

    +
  1. On the first occurrence of an error, the user-defined p_error() function +is called with the offending token as an argument. However, if the syntax error is due to +reaching the end-of-file, p_error() is called with an + argument of None. +Afterwards, the parser enters +an "error-recovery" mode in which it will not make future calls to p_error() until it +has successfully shifted at least 3 tokens onto the parsing stack. + +

    +

  2. If no recovery action is taken in p_error(), the offending lookahead token is replaced +with a special error token. + +

    +

  3. If the offending lookahead token is already set to error, the top item of the parsing stack is +deleted. + +

    +

  4. If the entire parsing stack is unwound, the parser enters a restart state and attempts to start +parsing from its initial state. + +

    +

  5. If a grammar rule accepts error as a token, it will be +shifted onto the parsing stack. + +

    +

  6. If the top item of the parsing stack is error, lookahead tokens will be discarded until the +parser can successfully shift a new symbol or reduce a rule involving error. +
+ +

6.8.1 Recovery and resynchronization with error rules

+ + +The most well-behaved approach for handling syntax errors is to write grammar rules that include the error +token. For example, suppose your language had a grammar rule for a print statement like this: + +
+
+def p_statement_print(p):
+     'statement : PRINT expr SEMI'
+     ...
+
+
+ +To account for the possibility of a bad expression, you might write an additional grammar rule like this: + +
+
+def p_statement_print_error(p):
+     'statement : PRINT error SEMI'
+     print("Syntax error in print statement. Bad expression")
+
+
+
+ +In this case, the error token will match any sequence of +tokens that might appear up to the first semicolon that is +encountered. Once the semicolon is reached, the rule will be +invoked and the error token will go away. + +

+This type of recovery is sometimes known as parser resynchronization. +The error token acts as a wildcard for any bad input text and +the token immediately following error acts as a +synchronization token. + +

+It is important to note that the error token usually does not appear as the last token +on the right in an error rule. For example: + +

+
+def p_statement_print_error(p):
+    'statement : PRINT error'
+    print("Syntax error in print statement. Bad expression")
+
+
+ +This is because the first bad token encountered will cause the rule to +be reduced--which may make it difficult to recover if more bad tokens +immediately follow. + +

6.8.2 Panic mode recovery

+ + +An alternative error recovery scheme is to enter a panic mode recovery in which tokens are +discarded to a point where the parser might be able to recover in some sensible manner. + +

+Panic mode recovery is implemented entirely in the p_error() function. For example, this +function starts discarding tokens until it reaches a closing '}'. Then, it restarts the +parser in its initial state. + +

+
+def p_error(p):
+    print("Whoa. You are seriously hosed.")
+    if not p:
+        print("End of File!")
+        return
+
+    # Read ahead looking for a closing '}'
+    while True:
+        tok = parser.token()             # Get the next token
+        if not tok or tok.type == 'RBRACE': 
+            break
+    parser.restart()
+
+
+ +

+This function simply discards the bad token and tells the parser that the error was ok. + +

+
+def p_error(p):
+    if p:
+         print("Syntax error at token", p.type)
+         # Just discard the token and tell the parser it's okay.
+         parser.errok()
+    else:
+         print("Syntax error at EOF")
+
+
+ +

+More information on these methods is as follows: +

+ +

+

+ +

+To supply the next lookahead token to the parser, p_error() can return a token. This might be +useful if trying to synchronize on special characters. For example: + +

+
+def p_error(p):
+    # Read ahead looking for a terminating ";"
+    while True:
+        tok = parser.token()             # Get the next token
+        if not tok or tok.type == 'SEMI': break
+    parser.errok()
+
+    # Return SEMI to the parser as the next lookahead token
+    return tok  
+
+
+ +

+Keep in mind in that the above error handling functions, +parser is an instance of the parser created by +yacc(). You'll need to save this instance someplace in your +code so that you can refer to it during error handling. +

+ +

6.8.3 Signalling an error from a production

+ + +If necessary, a production rule can manually force the parser to enter error recovery. This +is done by raising the SyntaxError exception like this: + +
+
+def p_production(p):
+    'production : some production ...'
+    raise SyntaxError
+
+
+ +The effect of raising SyntaxError is the same as if the last symbol shifted onto the +parsing stack was actually a syntax error. Thus, when you do this, the last symbol shifted is popped off +of the parsing stack and the current lookahead token is set to an error token. The parser +then enters error-recovery mode where it tries to reduce rules that can accept error tokens. +The steps that follow from this point are exactly the same as if a syntax error were detected and +p_error() were called. + +

+One important aspect of manually setting an error is that the p_error() function will NOT be +called in this case. If you need to issue an error message, make sure you do it in the production that +raises SyntaxError. + +

+Note: This feature of PLY is meant to mimic the behavior of the YYERROR macro in yacc. + +

6.8.4 When Do Syntax Errors Get Reported

+ + +

+In most cases, yacc will handle errors as soon as a bad input token is +detected on the input. However, be aware that yacc may choose to +delay error handling until after it has reduced one or more grammar +rules first. This behavior might be unexpected, but it's related to +special states in the underlying parsing table known as "defaulted +states." A defaulted state is parsing condition where the same +grammar rule will be reduced regardless of what valid token +comes next on the input. For such states, yacc chooses to go ahead +and reduce the grammar rule without reading the next input +token. If the next token is bad, yacc will eventually get around to reading it and +report a syntax error. It's just a little unusual in that you might +see some of your grammar rules firing immediately prior to the syntax +error. +

+ +

+Usually, the delayed error reporting with defaulted states is harmless +(and there are other reasons for wanting PLY to behave in this way). +However, if you need to turn this behavior off for some reason. You +can clear the defaulted states table like this: +

+ +
+
+parser = yacc.yacc()
+parser.defaulted_states = {}
+
+
+ +

+Disabling defaulted states is not recommended if your grammar makes use +of embedded actions as described in Section 6.11.

+ +

6.8.5 General comments on error handling

+ + +For normal types of languages, error recovery with error rules and resynchronization characters is probably the most reliable +technique. This is because you can instrument the grammar to catch errors at selected places where it is relatively easy +to recover and continue parsing. Panic mode recovery is really only useful in certain specialized applications where you might want +to discard huge portions of the input text to find a valid restart point. + +

6.9 Line Number and Position Tracking

+ + +Position tracking is often a tricky problem when writing compilers. +By default, PLY tracks the line number and position of all tokens. +This information is available using the following functions: + + + +For example: + +
+
+def p_expression(p):
+    'expression : expression PLUS expression'
+    line   = p.lineno(2)        # line number of the PLUS token
+    index  = p.lexpos(2)        # Position of the PLUS token
+
+
+ +As an optional feature, yacc.py can automatically track line +numbers and positions for all of the grammar symbols as well. +However, this extra tracking requires extra processing and can +significantly slow down parsing. Therefore, it must be enabled by +passing the +tracking=True option to yacc.parse(). For example: + +
+
+yacc.parse(data,tracking=True)
+
+
+ +Once enabled, the lineno() and lexpos() methods work +for all grammar symbols. In addition, two additional methods can be +used: + + + +For example: + +
+
+def p_expression(p):
+    'expression : expression PLUS expression'
+    p.lineno(1)        # Line number of the left expression
+    p.lineno(2)        # line number of the PLUS operator
+    p.lineno(3)        # line number of the right expression
+    ...
+    start,end = p.linespan(3)    # Start,end lines of the right expression
+    starti,endi = p.lexspan(3)   # Start,end positions of right expression
+
+
+
+ +Note: The lexspan() function only returns the range of values up to the start of the last grammar symbol. + +

+Although it may be convenient for PLY to track position information on +all grammar symbols, this is often unnecessary. For example, if you +are merely using line number information in an error message, you can +often just key off of a specific token in the grammar rule. For +example: + +

+
+def p_bad_func(p):
+    'funccall : fname LPAREN error RPAREN'
+    # Line number reported from LPAREN token
+    print("Bad function call at line", p.lineno(2))
+
+
+ +

+Similarly, you may get better parsing performance if you only +selectively propagate line number information where it's needed using +the p.set_lineno() method. For example: + +

+
+def p_fname(p):
+    'fname : ID'
+    p[0] = p[1]
+    p.set_lineno(0,p.lineno(1))
+
+
+ +PLY doesn't retain line number information from rules that have already been +parsed. If you are building an abstract syntax tree and need to have line numbers, +you should make sure that the line numbers appear in the tree itself. + +

6.10 AST Construction

+ + +yacc.py provides no special functions for constructing an +abstract syntax tree. However, such construction is easy enough to do +on your own. + +

A minimal way to construct a tree is to simply create and +propagate a tuple or list in each grammar rule function. There +are many possible ways to do this, but one example would be something +like this: + +

+
+def p_expression_binop(p):
+    '''expression : expression PLUS expression
+                  | expression MINUS expression
+                  | expression TIMES expression
+                  | expression DIVIDE expression'''
+
+    p[0] = ('binary-expression',p[2],p[1],p[3])
+
+def p_expression_group(p):
+    'expression : LPAREN expression RPAREN'
+    p[0] = ('group-expression',p[2])
+
+def p_expression_number(p):
+    'expression : NUMBER'
+    p[0] = ('number-expression',p[1])
+
+
+ +

+Another approach is to create a set of data structure for different +kinds of abstract syntax tree nodes and assign nodes to p[0] +in each rule. For example: + +

+
+class Expr: pass
+
+class BinOp(Expr):
+    def __init__(self,left,op,right):
+        self.type = "binop"
+        self.left = left
+        self.right = right
+        self.op = op
+
+class Number(Expr):
+    def __init__(self,value):
+        self.type = "number"
+        self.value = value
+
+def p_expression_binop(p):
+    '''expression : expression PLUS expression
+                  | expression MINUS expression
+                  | expression TIMES expression
+                  | expression DIVIDE expression'''
+
+    p[0] = BinOp(p[1],p[2],p[3])
+
+def p_expression_group(p):
+    'expression : LPAREN expression RPAREN'
+    p[0] = p[2]
+
+def p_expression_number(p):
+    'expression : NUMBER'
+    p[0] = Number(p[1])
+
+
+ +The advantage to this approach is that it may make it easier to attach more complicated +semantics, type checking, code generation, and other features to the node classes. + +

+To simplify tree traversal, it may make sense to pick a very generic +tree structure for your parse tree nodes. For example: + +

+
+class Node:
+    def __init__(self,type,children=None,leaf=None):
+         self.type = type
+         if children:
+              self.children = children
+         else:
+              self.children = [ ]
+         self.leaf = leaf
+	 
+def p_expression_binop(p):
+    '''expression : expression PLUS expression
+                  | expression MINUS expression
+                  | expression TIMES expression
+                  | expression DIVIDE expression'''
+
+    p[0] = Node("binop", [p[1],p[3]], p[2])
+
+
+ +

6.11 Embedded Actions

+ + +The parsing technique used by yacc only allows actions to be executed at the end of a rule. For example, +suppose you have a rule like this: + +
+
+def p_foo(p):
+    "foo : A B C D"
+    print("Parsed a foo", p[1],p[2],p[3],p[4])
+
+
+ +

+In this case, the supplied action code only executes after all of the +symbols A, B, C, and D have been +parsed. Sometimes, however, it is useful to execute small code +fragments during intermediate stages of parsing. For example, suppose +you wanted to perform some action immediately after A has +been parsed. To do this, write an empty rule like this: + +

+
+def p_foo(p):
+    "foo : A seen_A B C D"
+    print("Parsed a foo", p[1],p[3],p[4],p[5])
+    print("seen_A returned", p[2])
+
+def p_seen_A(p):
+    "seen_A :"
+    print("Saw an A = ", p[-1])   # Access grammar symbol to left
+    p[0] = some_value            # Assign value to seen_A
+
+
+
+ +

+In this example, the empty seen_A rule executes immediately +after A is shifted onto the parsing stack. Within this +rule, p[-1] refers to the symbol on the stack that appears +immediately to the left of the seen_A symbol. In this case, +it would be the value of A in the foo rule +immediately above. Like other rules, a value can be returned from an +embedded action by simply assigning it to p[0] + +

+The use of embedded actions can sometimes introduce extra shift/reduce conflicts. For example, +this grammar has no conflicts: + +

+
+def p_foo(p):
+    """foo : abcd
+           | abcx"""
+
+def p_abcd(p):
+    "abcd : A B C D"
+
+def p_abcx(p):
+    "abcx : A B C X"
+
+
+ +However, if you insert an embedded action into one of the rules like this, + +
+
+def p_foo(p):
+    """foo : abcd
+           | abcx"""
+
+def p_abcd(p):
+    "abcd : A B C D"
+
+def p_abcx(p):
+    "abcx : A B seen_AB C X"
+
+def p_seen_AB(p):
+    "seen_AB :"
+
+
+ +an extra shift-reduce conflict will be introduced. This conflict is +caused by the fact that the same symbol C appears next in +both the abcd and abcx rules. The parser can either +shift the symbol (abcd rule) or reduce the empty +rule seen_AB (abcx rule). + +

+A common use of embedded rules is to control other aspects of parsing +such as scoping of local variables. For example, if you were parsing C code, you might +write code like this: + +

+
+def p_statements_block(p):
+    "statements: LBRACE new_scope statements RBRACE"""
+    # Action code
+    ...
+    pop_scope()        # Return to previous scope
+
+def p_new_scope(p):
+    "new_scope :"
+    # Create a new scope for local variables
+    s = new_scope()
+    push_scope(s)
+    ...
+
+
+ +In this case, the embedded action new_scope executes +immediately after a LBRACE ({) symbol is parsed. +This might adjust internal symbol tables and other aspects of the +parser. Upon completion of the rule statements_block, code +might undo the operations performed in the embedded action +(e.g., pop_scope()). + +

6.12 Miscellaneous Yacc Notes

+ + + +

+ + +

7. Multiple Parsers and Lexers

+ + +In advanced parsing applications, you may want to have multiple +parsers and lexers. + +

+As a general rules this isn't a problem. However, to make it work, +you need to carefully make sure everything gets hooked up correctly. +First, make sure you save the objects returned by lex() and +yacc(). For example: + +

+
+lexer  = lex.lex()       # Return lexer object
+parser = yacc.yacc()     # Return parser object
+
+
+ +Next, when parsing, make sure you give the parse() function a reference to the lexer it +should be using. For example: + +
+
+parser.parse(text,lexer=lexer)
+
+
+ +If you forget to do this, the parser will use the last lexer +created--which is not always what you want. + +

+Within lexer and parser rule functions, these objects are also +available. In the lexer, the "lexer" attribute of a token refers to +the lexer object that triggered the rule. For example: + +

+
+def t_NUMBER(t):
+   r'\d+'
+   ...
+   print(t.lexer)           # Show lexer object
+
+
+ +In the parser, the "lexer" and "parser" attributes refer to the lexer +and parser objects respectively. + +
+
+def p_expr_plus(p):
+   'expr : expr PLUS expr'
+   ...
+   print(p.parser)          # Show parser object
+   print(p.lexer)           # Show lexer object
+
+
+ +If necessary, arbitrary attributes can be attached to the lexer or parser object. +For example, if you wanted to have different parsing modes, you could attach a mode +attribute to the parser object and look at it later. + +

8. Using Python's Optimized Mode

+ + +Because PLY uses information from doc-strings, parsing and lexing +information must be gathered while running the Python interpreter in +normal mode (i.e., not with the -O or -OO options). However, if you +specify optimized mode like this: + +
+
+lex.lex(optimize=1)
+yacc.yacc(optimize=1)
+
+
+ +then PLY can later be used when Python runs in optimized mode. To make this work, +make sure you first run Python in normal mode. Once the lexing and parsing tables +have been generated the first time, run Python in optimized mode. PLY will use +the tables without the need for doc strings. + +

+Beware: running PLY in optimized mode disables a lot of error +checking. You should only do this when your project has stabilized +and you don't need to do any debugging. One of the purposes of +optimized mode is to substantially decrease the startup time of +your compiler (by assuming that everything is already properly +specified and works). + +

9. Advanced Debugging

+ + +

+Debugging a compiler is typically not an easy task. PLY provides some +advanced diagostic capabilities through the use of Python's +logging module. The next two sections describe this: + +

9.1 Debugging the lex() and yacc() commands

+ + +

+Both the lex() and yacc() commands have a debugging +mode that can be enabled using the debug flag. For example: + +

+
+lex.lex(debug=True)
+yacc.yacc(debug=True)
+
+
+ +Normally, the output produced by debugging is routed to either +standard error or, in the case of yacc(), to a file +parser.out. This output can be more carefully controlled +by supplying a logging object. Here is an example that adds +information about where different debugging messages are coming from: + +
+
+# Set up a logging object
+import logging
+logging.basicConfig(
+    level = logging.DEBUG,
+    filename = "parselog.txt",
+    filemode = "w",
+    format = "%(filename)10s:%(lineno)4d:%(message)s"
+)
+log = logging.getLogger()
+
+lex.lex(debug=True,debuglog=log)
+yacc.yacc(debug=True,debuglog=log)
+
+
+ +If you supply a custom logger, the amount of debugging +information produced can be controlled by setting the logging level. +Typically, debugging messages are either issued at the DEBUG, +INFO, or WARNING levels. + +

+PLY's error messages and warnings are also produced using the logging +interface. This can be controlled by passing a logging object +using the errorlog parameter. + +

+
+lex.lex(errorlog=log)
+yacc.yacc(errorlog=log)
+
+
+ +If you want to completely silence warnings, you can either pass in a +logging object with an appropriate filter level or use the NullLogger +object defined in either lex or yacc. For example: + +
+
+yacc.yacc(errorlog=yacc.NullLogger())
+
+
+ +

9.2 Run-time Debugging

+ + +

+To enable run-time debugging of a parser, use the debug option to parse. This +option can either be an integer (which simply turns debugging on or off) or an instance +of a logger object. For example: + +

+
+log = logging.getLogger()
+parser.parse(input,debug=log)
+
+
+ +If a logging object is passed, you can use its filtering level to control how much +output gets generated. The INFO level is used to produce information +about rule reductions. The DEBUG level will show information about the +parsing stack, token shifts, and other details. The ERROR level shows information +related to parsing errors. + +

+For very complicated problems, you should pass in a logging object that +redirects to a file where you can more easily inspect the output after +execution. + +

11. Where to go from here?

+ + +The examples directory of the PLY distribution contains several simple examples. Please consult a +compilers textbook for the theory and underlying implementation details or LR parsing. + + + + + + + + + + diff --git a/example/calc/calc.py b/example/calc/calc.py index a4a2689..d8bd7e0 100644 --- a/example/calc/calc.py +++ b/example/calc/calc.py @@ -44,46 +44,45 @@ class CalcParser(Parser): @_('NAME "=" expression') def statement(self, p): - self.names[p[1]] = p[3] + self.names[p.NAME] = p.expression @_('expression') def statement(self, p): - print(p[1]) + print(p.expression) @_('expression "+" expression', 'expression "-" expression', 'expression "*" expression', 'expression "/" expression') def expression(self, p): - if p[2] == '+': - p[0] = p[1] + p[3] - elif p[2] == '-': - p[0] = p[1] - p[3] - elif p[2] == '*': - p[0] = p[1] * p[3] - elif p[2] == '/': - p[0] = p[1] / p[3] + if p[1] == '+': + return p.expression0 + p.expression1 + elif p[1] == '-': + return p.expression0 - p.expression1 + elif p[1] == '*': + return p.expression0 * p.expression1 + elif p[1] == '/': + return p.expression0 / p.expression1 @_('"-" expression %prec UMINUS') def expression(self, p): - p[0] = -p[2] + return -p.expression @_('"(" expression ")"') def expression(self, p): - p[0] = p[2] + return p.expression @_('NUMBER') def expression(self, p): - p[0] = p[1] + return p.NUMBER @_('NAME') def expression(self, p): try: - p[0] = self.names[p[1]] + return self.names[p.NAME] except LookupError: - print("Undefined name '%s'" % p[1]) - p[0] = 0 - + print("Undefined name '%s'" % p.NAME) + return 0 if __name__ == '__main__': lexer = CalcLexer() diff --git a/sly/lex.py b/sly/lex.py index acdb43f..a31dd80 100644 --- a/sly/lex.py +++ b/sly/lex.py @@ -68,9 +68,9 @@ class Token(object): def __repr__(self): return 'Token(%s, %r, %d, %d)' % (self.type, self.value, self.lineno, self.index) -class NoDupeDict(OrderedDict): +class LexerMetaDict(OrderedDict): ''' - Special dictionary that prohits duplicate definitions. + Special dictionary that prohits duplicate definitions in lexer specifications. ''' def __setitem__(self, key, value): if key in self and not isinstance(value, property): @@ -83,17 +83,15 @@ class LexerMeta(type): ''' @classmethod def __prepare__(meta, *args, **kwargs): - d = NoDupeDict() - def _(*patterns): + d = LexerMetaDict() + def _(pattern, *extra): + patterns = [pattern, *extra] def decorate(func): - for pattern in patterns: - if hasattr(func, 'pattern'): - if isinstance(pattern, str): - func.pattern = ''.join(['(', pattern, ')|(', func.pattern, ')']) - else: - func.pattern = b''.join([b'(', pattern, b')|(', func.pattern, b')']) - else: - func.pattern = pattern + pattern = '|'.join('(%s)' % pat for pat in patterns ) + if hasattr(func, 'pattern'): + func.pattern = pattern + '|' + func.pattern + else: + func.pattern = pattern return func return decorate d['_'] = _ @@ -109,7 +107,7 @@ class Lexer(metaclass=LexerMeta): # These attributes may be defined in subclasses tokens = set() literals = set() - ignore = None + ignore = '' reflags = 0 # These attributes are constructed automatically by the associated metaclass @@ -118,7 +116,6 @@ class Lexer(metaclass=LexerMeta): _literals = set() _token_funcs = { } _ignored_tokens = set() - _input_type = str @classmethod def _collect_rules(cls, definitions): @@ -151,7 +148,7 @@ class Lexer(metaclass=LexerMeta): tokname = tokname[7:] cls._ignored_tokens.add(tokname) - if isinstance(value, (str, bytes)): + if isinstance(value, str): pattern = value elif callable(value): @@ -159,10 +156,7 @@ class Lexer(metaclass=LexerMeta): cls._token_funcs[tokname] = value # Form the regular expression component - if isinstance(pattern, str): - part = '(?P<%s>%s)' % (tokname, pattern) - else: - part = b'(?P<%s>%s)' % (tokname.encode('ascii'), pattern) + part = '(?P<%s>%s)' % (tokname, pattern) # Make sure the individual regex compiles properly try: @@ -171,38 +165,24 @@ class Lexer(metaclass=LexerMeta): raise PatternError('Invalid regex for token %s' % tokname) from e # Verify that the pattern doesn't match the empty string - if cpat.match(type(pattern)()): + if cpat.match(''): raise PatternError('Regex for token %s matches empty input' % tokname) parts.append(part) - # If no parts collected, then no rules to process if not parts: return - # Verify that all of the patterns are of the same type - if not all(type(part) == type(parts[0]) for part in parts): - raise LexerBuildError('Tokens are specified using both bytes and strings.') - # Form the master regular expression - if parts and isinstance(parts[0], bytes): - previous = (b'|' + cls._master_re.pattern) if cls._master_re else b'' - cls._master_re = re.compile(b'|'.join(parts) + previous, cls.reflags) - cls._input_type = bytes - else: - previous = ('|' + cls._master_re.pattern) if cls._master_re else '' - cls._master_re = re.compile('|'.join(parts) + previous, cls.reflags) - cls._input_type = str + previous = ('|' + cls._master_re.pattern) if cls._master_re else '' + cls._master_re = re.compile('|'.join(parts) + previous, cls.reflags) # Verify that that ignore and literals specifiers match the input type - if cls.ignore is not None and not isinstance(cls.ignore, cls._input_type): - raise LexerBuildError("ignore specifier type doesn't match token types (%s)" % - cls._input_type.__name__) + if not isinstance(cls.ignore, str): + raise LexerBuildError('ignore specifier must be a string') - if not all(isinstance(lit, cls._input_type) for lit in cls.literals): - raise LexerBuildError("literals specifier not using same type as tokens (%s)" % - cls._input_type.__name__) - + if not all(isinstance(lit, str) for lit in cls.literals): + raise LexerBuildError("literals must be specified as strings") def tokenize(self, text, lineno=1, index=0): # Local copies of frequently used values @@ -220,11 +200,6 @@ class Lexer(metaclass=LexerMeta): index += 1 continue except IndexError: - if self.eof: - text = self.eof() - if text: - index = 0 - continue break tok = Token() @@ -270,9 +245,6 @@ class Lexer(metaclass=LexerMeta): self.index = index self.lineno = lineno - # Default implementations of methods that may be subclassed by users + # Default implementations of the error handler. May be changed in subclasses def error(self, value): raise LexError("Illegal character %r at index %d" % (value[0], self.index), value) - - def eof(self): - pass diff --git a/sly/yacc.py b/sly/yacc.py index 6a88580..61d8c30 100644 --- a/sly/yacc.py +++ b/sly/yacc.py @@ -33,7 +33,7 @@ import sys import inspect -from collections import OrderedDict +from collections import OrderedDict, defaultdict __version__ = '0.0' __all__ = [ 'Parser' ] @@ -104,31 +104,39 @@ class YaccSymbol: class YaccProduction: def __init__(self, s, stack=None): - self.slice = s - self.stack = stack + self._slice = s + self._stack = stack + self._namemap = { } def __getitem__(self, n): - if isinstance(n, slice): - return [s.value for s in self.slice[n]] - elif n >= 0: - return self.slice[n].value + if n >= 0: + return self._slice[n].value else: - return self.stack[n].value + return self._stack[n].value def __setitem__(self, n, v): - self.slice[n].value = v + self._slice[n].value = v def __len__(self): - return len(self.slice) + return len(self._slice) def lineno(self, n): - return getattr(self.slice[n], 'lineno', 0) + return getattr(self._slice[n], 'lineno', 0) def set_lineno(self, n, lineno): - self.slice[n].lineno = lineno + self._slice[n].lineno = lineno def index(self, n): - return getattr(self.slice[n], 'index', 0) + return getattr(self._slice[n], 'index', 0) + + def __getattr__(self, name): + return self._slice[self._namemap[name]].value + + def __setattr__(self, name, value): + if name[0:1] == '_' or name not in self._namemap: + super().__setattr__(name, value) + else: + self._slice[self._namemap[name]].value = value # ----------------------------------------------------------------------------- # === Grammar Representation === @@ -171,17 +179,29 @@ class Production(object): self.file = file self.line = line self.prec = precedence - + # Internal settings used during table construction - self.len = len(self.prod) # Length of the production # Create a list of unique production symbols used in the production self.usyms = [] - for s in self.prod: + symmap = defaultdict(list) + for n, s in enumerate(self.prod): + symmap[s].append(n) if s not in self.usyms: self.usyms.append(s) + # Create a dict mapping symbol names to indices + m = {} + for key, indices in symmap.items(): + if len(indices) == 1: + m[key] = indices[0] + else: + for n, index in enumerate(indices): + m[key+str(n)] = index + + self.namemap = m + # List of all LR items for the production self.lr_items = [] self.lr_next = None @@ -1512,9 +1532,10 @@ def _collect_grammar_rules(func): else: grammar.append((func, filename, lineno, prodname, syms)) func = getattr(func, 'next_func', None) + return grammar -class OverloadDict(OrderedDict): +class ParserMetaDict(OrderedDict): ''' Dictionary that allows decorated grammar rule functions to be overloaded ''' @@ -1526,13 +1547,11 @@ class OverloadDict(OrderedDict): class ParserMeta(type): @classmethod def __prepare__(meta, *args, **kwargs): - d = OverloadDict() - def _(*rules): + d = ParserMetaDict() + def _(rule, *extra): + rules = [rule, *extra] def decorate(func): - if hasattr(func, 'rules'): - func.rules.extend(rules[::-1]) - else: - func.rules = list(rules[::-1]) + func.rules = [ *getattr(func, 'rules', []), *rules[::-1] ] return func return decorate d['_'] = _ @@ -1788,9 +1807,9 @@ class Parser(metaclass=ParserMeta): self.statestack.append(0) self.state = 0 - def parse(self, lexer): + def parse(self, tokens): ''' - Parse the given input text. lexer is a Lexer object that produces tokens + Parse the given input tokens. ''' lookahead = None # Current lookahead symbol lookaheadstack = [] # Stack of lookahead symbols @@ -1801,10 +1820,6 @@ class Parser(metaclass=ParserMeta): pslice = YaccProduction(None) # Production object passed to grammar rules errorcount = 0 # Used during error recovery - # Save a local reference of the lexer being used - self.lexer = lexer - tokens = iter(self.lexer) - # Set up the state and symbol stacks self.statestack = statestack = [] # Stack of parsing states self.symstack = symstack = [] # Stack of grammar symbols @@ -1816,7 +1831,6 @@ class Parser(metaclass=ParserMeta): # Get the next symbol on the input. If a lookahead symbol # is already set, we just use that. Otherwise, we'll pull # the next token off of the lookaheadstack or from the lexer - if self.state not in defaulted_states: if not lookahead: if not lookaheadstack: @@ -1852,74 +1866,22 @@ class Parser(metaclass=ParserMeta): self.production = p = prod[-t] pname = p.name plen = p.len + pslice._namemap = p.namemap # Call the production function - sym = YaccSymbol() - sym.type = pname # Production name - sym.value = None - + pslice._slice = symstack[-plen:] if plen else [] if plen: - targ = symstack[-plen-1:] - targ[0] = sym + del symstack[-plen:] + del statestack[-plen:] - # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! - # The code enclosed in this section is duplicated - # below as a performance optimization. Make sure - # changes get made in both locations. + sym = YaccSymbol() + sym.type = pname + sym.value = p.func(self, pslice) + symstack.append(sym) - pslice.slice = targ - - try: - # Call the grammar rule with our special slice object - del symstack[-plen:] - p.func(self, pslice) - del statestack[-plen:] - symstack.append(sym) - self.state = goto[statestack[-1]][pname] - statestack.append(self.state) - except SyntaxError: - # If an error was set. Enter error recovery state - lookaheadstack.append(lookahead) - symstack.extend(targ[1:-1]) - statestack.pop() - self.state = statestack[-1] - sym.type = 'error' - sym.value = 'error' - lookahead = sym - errorcount = ERROR_COUNT - self.errorok = False - continue - # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! - - else: - - targ = [sym] - - # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! - # The code enclosed in this section is duplicated - # above as a performance optimization. Make sure - # changes get made in both locations. - - pslice.slice = targ - - try: - # Call the grammar rule with our special slice object - p.func(self, pslice) - symstack.append(sym) - self.state = goto[statestack[-1]][pname] - statestack.append(self.state) - except SyntaxError: - # If an error was set. Enter error recovery state - lookaheadstack.append(lookahead) - statestack.pop() - self.state = statestack[-1] - sym.type = 'error' - sym.value = 'error' - lookahead = sym - errorcount = ERROR_COUNT - self.errorok = False - continue - # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + self.state = goto[statestack[-1]][pname] + statestack.append(self.state) + continue if t == 0: n = symstack[-1]