diff --git a/docs/sly.rst b/docs/sly.rst index 8d56f41..aa72daa 100644 --- a/docs/sly.rst +++ b/docs/sly.rst @@ -403,7 +403,11 @@ Writing a Parser The ``Parser`` class is used to parse language syntax. Before showing an example, there are a few important bits of background that must be -mentioned. First, *syntax* is usually specified in terms of a BNF +mentioned. + +Parsing Background +^^^^^^^^^^^^^^^^^^ +When writing a parser, *syntax* is usually specified in terms of a BNF grammar. For example, if you wanted to parse simple arithmetic expressions, you might first write an unambiguous grammar specification like this:: @@ -427,7 +431,7 @@ identifiers are known as *non-terminals*. The semantic behavior of a language is often specified using a technique known as syntax directed translation. In syntax directed -translation, attributes are attached to each symbol in a given grammar +translation, values are attached to each symbol in a given grammar rule along with an action. Whenever a particular grammar rule is recognized, the action describes what to do. For example, given the expression grammar above, you might write the specification for a @@ -456,17 +460,17 @@ values. SLY uses a parsing technique known as LR-parsing or shift-reduce parsing. LR parsing is a bottom up technique that tries to recognize the right-hand-side of various grammar rules. Whenever a valid -right-hand-side is found in the input, the appropriate action code is -triggered and the grammar symbols are replaced by the grammar symbol -on the left-hand-side. +right-hand-side is found in the input, the appropriate action method +is triggered and the grammar symbols on right hand side are replaced +by the grammar symbol on the left-hand-side. LR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next input token for patterns that match one of the grammar rules. The details of the algorithm can be found in a compiler textbook, but the following example illustrates -the steps that are performed if you wanted to parse the expression ``3 -+ 5 * (10 - 20)`` using the grammar defined above. In the example, -the special symbol ``$`` represents the end of input:: +the steps that are performed when parsing the expression ``3 + 5 * (10 +- 20)`` using the grammar defined above. In the example, the special +symbol ``$`` represents the end of input:: Step Symbol Stack Input Tokens Action ---- --------------------- --------------------- ------------------------------- @@ -519,9 +523,9 @@ rule ``expr : expr + term``. Parsing Example ^^^^^^^^^^^^^^^ -Suppose you wanted to make a grammar for simple arithmetic expressions as previously described. -Here is how you would do it with SLY:: - +Suppose you wanted to make a grammar for evaluating simple arithmetic +expressions as previously described. Here is how you would do it with +SLY:: from sly import Parser from calclex import CalcLexer @@ -533,35 +537,35 @@ Here is how you would do it with SLY:: # Grammar rules and actions @_('expr PLUS term') def expr(self, p): - p[0] = p[1] + p[3] + return p[0] + p[2] @_('expr MINUS term') def expr(self, p): - p[0] = p[1] - p[3] + return p[0] - p[2] @_('term') def expr(self, p): - p[0] = p[1] + return p[0] @_('term TIMES factor') def term(self, p): - p[0] = p[1] * p[3] + return p[0] * p[2] @_('term DIVIDE factor') def term(self, p): - p[0] = p[1] / p[3] + return p[0] / p[2] @_('factor') def term(self, p): - p[0] = p[1] + return p[0] @_('NUMBER') def factor(self, p): - p[0] = p[1] + return p[0] @_('LPAREN expr RPAREN') def factor(self, p): - p[0] = p[2] + return p[1] # Error rule for syntax errors def error(self, p): @@ -579,110 +583,98 @@ Here is how you would do it with SLY:: except EOFError: break -In this example, each grammar rule is defined by a Python function -where the docstring to that function contains the appropriate -context-free grammar specification. The statements that make up the -function body implement the semantic actions of the rule. Each function -accepts a single argument ``p`` that is a sequence containing the -values of each grammar symbol in the corresponding rule. The values -of ``p[i]`` are mapped to grammar symbols as shown here:: +In this example, each grammar rule is defined by a method that's been +decorated by ``@_(rule)`` decorator. The very first grammar rule +defines the top of the parse. The name of each method should match +the name of the grammar rule being parsed. The argument to the +``@_()`` decorator is a string describing the right-hand-side of the +grammar. Thus, a grammar rule like this:: - # p[1] p[2] p[3] + expr : expr PLUS term + +becomes a method like this:: + + @_('expr PLUS term') + def expr(self, p): + ... + +The method is triggered when that grammar rule is recognized on the +input. As an argument, the method receives a sequence of grammar symbol +values ``p`` that is accessed as an array. The mapping between +elements of ``p`` and the grammar rule is as shown here:: + + # p[0] p[1] p[2] # | | | @_('expr PLUS term') def expr(self, p): - p[0] = p[1] + p[3] + ... -For tokens, the "value" of the corresponding ``p[i]`` is the -*same* as the ``p.value`` attribute assigned in the lexer -module. For non-terminals, the value is determined by whatever is -placed in ``p[0]`` when rules are reduced. This value can be -anything at all. However, it probably most common for the value to be -a simple Python type, a tuple, or an instance. In this example, we -are relying on the fact that the ``NUMBER`` token stores an -integer value in its value field. All of the other rules simply -perform various types of integer operations and propagate the result. +For tokens, the value of the corresponding ``p[i]`` is the *same* as +the ``p.value`` attribute assigned to tokens in the lexer module. For +non-terminals, the value is whatever was returned by the methods +defined for that rule. -Note: The use of negative indices have a special meaning in -yacc---specially ``p[-1]`` does not have the same value -as ``p[3]`` in this example. Please see the section on "Embedded -Actions" for further details. +Within each rule, you return a value that becomes associated with that +grammar symbol elsewhere. In the example shown, rules are carrying out +the evaluation of an arithmetic expression:: -The first rule defined in the yacc specification determines the -starting grammar symbol (in this case, a rule for ``expr`` -appears first). Whenever the starting rule is reduced by the parser -and no more input is available, parsing stops and the final value is -returned (this value will be whatever the top-most rule placed -in ``p[0]``). Note: an alternative starting symbol can be -specified using the ``start`` attribute in the class. + @_('expr PLUS term') + def expr(self, p): + return p[0] + p[2] -The ``error()`` method is defined to catch syntax errors. +There are many other kinds of things that might happen in a rule +though. For example, a rule might construct part of a parse tree +instead:: + + @_('expr PLUS term') + def expr(self, p): + return ('+', p[0], p[2]) + +or perhaps create an instance related to an abstract syntax tree:: + + class BinOp(object): + def __init__(self, op, left, right): + self.op = op + self.left = left + self.right = right + + @_('expr PLUS term') + def expr(self, p): + return BinOp('+', p[0], p[2]) + +The key thing is that the method returns the value that's going to +be attached to the symbol "expr" in this case. + +The ``error()`` method is defined to handle syntax errors (if any). See the error handling section below for more detail. -If any errors are detected in your grammar specification, SLY will -produce diagnostic messages and possibly raise an exception. Some of -the errors that can be detected include: - -- Duplicated grammar rules -- Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars. -- Badly specified grammar rules. -- Infinite recursion (rules that can never terminate). -- Unused rules and tokens -- Undefined rules and tokens - -The final part of the example shows how to actually run the parser. -To run the parser, you simply have to call the ``parse()`` method with -a sequence of the input tokens. This will run all of the grammar -rules and return the result of the entire parse. This result return -is the value assigned to ``p[0]`` in the starting grammar rule. - Combining Grammar Rule Functions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -When grammar rules are similar, they can be combined into a single function. -For example, consider the two rules in our earlier example:: +When grammar rules are similar, they can be combined into a single method. +For example, suppose you had two rules that were constructing a parse tree:: @_('expr PLUS term') def expr(self, p): - p[0] = p[1] + p[3] + return ('+', p[0], p[2]) @_('expr MINUS term') def expr(self, p): - p[0] = p[1] - p[3] + return ('-', p[0], p[2]) -Instead of writing two functions, you might write a single function like this: +Instead of writing two functions, you might write a single function like this:: @_('expr PLUS term', 'expr MINUS term') def expr(self, p): - if p[2] == '+': - p[0] = p[1] + p[3] - elif p[2] == '-': - p[0] = p[1] - p[3] + return (p[1], p[0], p[2]) In general, the ``@_()`` decorator for any given method can list multiple grammar rules. When combining grammar rules into a single -function though, it is usually a good idea for all of the rules to have a -similar structure (e.g., the same number of terms). Otherwise, the -corresponding action code may be more complicated than necessary. -However, it is possible to handle simple cases using len(). For -example: - - @_('expr MINUS expr', - 'MINUS expression') - def expr(self, p): - if (len(p) == 4): - p[0] = p[1] - p[3] - elif (len(p) == 3): - p[0] = -p[2] - -If parsing performance is a concern, you should resist the urge to put -too much conditional processing into a single grammar rule as shown in -these examples. When you add checks to see which grammar rule is -being handled, you are actually duplicating the work that the parser -has already performed (i.e., the parser already knows exactly what rule it -matched). You can eliminate this overhead by using a -separate method for each grammar rule. +function though, it is usually a good idea for all of the rules to +have a similar structure (e.g., the same number of terms). Otherwise, +the corresponding action code may end up being more complicated than +necessary. Character Literals ^^^^^^^^^^^^^^^^^^ @@ -692,14 +684,16 @@ literals. For example:: @_('expr "+" term') def expr(self, p): - p[0] = p[1] + p[3] + return p[0] + p[2] @_('expr "-" term') def expr(self, p): - p[0] = p[1] - p[3] + return p[0] - p[2] -A character literal must be enclosed in quotes such as ``"+"``. In addition, if literals are used, they must be declared in the -corresponding lexer class through the use of a special ``literals`` declaration:: +A character literal must be enclosed in quotes such as ``"+"``. In +addition, if literals are used, they must be declared in the +corresponding lexer class through the use of a special ``literals`` +declaration:: class CalcLexer(Lexer): ... @@ -707,9 +701,8 @@ corresponding lexer class through the use of a special ``literals`` declaration: ... Character literals are limited to a single character. Thus, it is not -legal to specify literals such as ``<=`` or ``==``. -For this, use the normal lexing rules (e.g., define a rule such as -``EQ = r'=='``). +legal to specify literals such as ``<=`` or ``==``. For this, use the +normal lexing rules (e.g., define a rule such as ``EQ = r'=='``). Empty Productions ^^^^^^^^^^^^^^^^^ @@ -720,7 +713,7 @@ If you need an empty production, define a special rule like this:: def empty(self, p): pass -Now to use the empty production, simply use 'empty' as a symbol. For +Now to use the empty production elsewhere, use the name 'empty' as a symbol. For example:: @_('item') @@ -732,16 +725,16 @@ example:: ... Note: You can write empty rules anywhere by simply specifying an empty -string. However, I personally find that writing an "empty" -rule and using "empty" to denote an empty production is easier to read -and more clearly states your intentions. +string. However,writing an "empty" rule and using "empty" to denote an +empty production may be easier to read and more clearly state your +intention. Changing the starting symbol ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Normally, the first rule found in a yacc specification defines the -starting grammar rule (top level rule). To change this, supply -a ``start`` specifier in your file. For example:: +Normally, the first rule found in a parser class defines the starting +grammar rule (top level rule). To change this, supply a ``start`` +specifier in your class. For example:: class CalcParser(Parser): start = 'foo' @@ -766,12 +759,12 @@ situations, it is extremely difficult or awkward to write grammars in this format. A much more natural way to express the grammar is in a more compact form like this:: -expr : expr PLUS expr - | expr MINUS expr - | expr TIMES expr - | expr DIVIDE expr - | LPAREN expr RPAREN - | NUMBER + expr : expr PLUS expr + | expr MINUS expr + | expr TIMES expr + | expr DIVIDE expr + | LPAREN expr RPAREN + | NUMBER Unfortunately, this grammar specification is ambiguous. For example, if you are parsing the string "3 * 4 + 5", there is no way to tell how @@ -801,17 +794,17 @@ perfectly legal from the rules of the context-free-grammar. By default, all shift/reduce conflicts are resolved in favor of shifting. Therefore, in the above example, the parser will always -shift the ``+`` instead of reducing. Although this strategy -works in many cases (for example, the case of -"if-then" versus "if-then-else"), it is not enough for arithmetic expressions. In fact, -in the above example, the decision to shift ``+`` is completely -wrong---we should have reduced ``expr * expr`` since -multiplication has higher mathematical precedence than addition. +shift the ``+`` instead of reducing. Although this strategy works in +many cases (for example, the case of "if-then" versus "if-then-else"), +it is not enough for arithmetic expressions. In fact, in the above +example, the decision to shift ``+`` is completely wrong---we should +have reduced ``expr * expr`` since multiplication has higher +mathematical precedence than addition. -To resolve ambiguity, especially in expression -grammars, SLY allows individual tokens to be assigned a -precedence level and associativity. This is done by adding a variable -``precedence`` to the grammar file like this:: +To resolve ambiguity, especially in expression grammars, SLY allows +individual tokens to be assigned a precedence level and associativity. +This is done by adding a variable ``precedence`` to the parser class +like this:: class CalcParser(Parser): ... @@ -865,12 +858,12 @@ rule is reduced for left associativity, whereas the token is shifted for right a 4. If nothing is known about the precedence, shift/reduce conflicts are resolved in favor of shifting (the default). -For example, if "expr PLUS expr" has been parsed and the -next token is "TIMES", the action is going to be a shift because -"TIMES" has a higher precedence level than "PLUS". On the other hand, -if "expr TIMES expr" has been parsed and the next token is -"PLUS", the action is going to be reduce because "PLUS" has a lower -precedence than "TIMES." +For example, if ``expr PLUS expr`` has been parsed and the +next token is ``TIMES``, the action is going to be a shift because +``TIMES`` has a higher precedence level than ``PLUS``. On the other hand, +if ``expr TIMES expr`` has been parsed and the next token is +``PLUS``, the action is going to be reduce because ``PLUS`` has a lower +precedence than ``TIMES.`` When shift/reduce conflicts are resolved using the first three techniques (with the help of precedence rules), SLY will @@ -878,10 +871,10 @@ report no errors or conflicts in the grammar. One problem with the precedence specifier technique is that it is sometimes necessary to change the precedence of an operator in certain -contexts. For example, consider a unary-minus operator in "3 + 4 * --5". Mathematically, the unary minus is normally given a very high +contexts. For example, consider a unary-minus operator in ``3 + 4 * +-5``. Mathematically, the unary minus is normally given a very high precedence--being evaluated before the multiply. However, in our -precedence specifier, MINUS has a lower precedence than TIMES. To +precedence specifier, ``MINUS`` has a lower precedence than ``TIMES``. To deal with this, precedence rules can be given for so-called "fictitious tokens" like this:: @@ -893,19 +886,19 @@ like this:: ('right', 'UMINUS'), # Unary minus operator ) -Now, in the grammar file, we can write our unary minus rule like this:: +Now, in the grammar file, you write the unary minus rule like this:: @_('MINUS expr %prec UMINUS') def expr(p): p[0] = -p[2] In this case, ``%prec UMINUS`` overrides the default rule precedence--setting it to that -of UMINUS in the precedence specifier. +of ``UMINUS`` in the precedence specifier. -At first, the use of UMINUS in this example may appear very confusing. -UMINUS is not an input token or a grammar rule. Instead, you should +At first, the use of ``UMINUS`` in this example may appear very confusing. +``UMINUS`` is not an input token or a grammar rule. Instead, you should think of it as the name of a special marker in the precedence table. -When you use the ``%prec`` qualifier, you're simply telling SLY +When you use the ``%prec`` qualifier, you're telling SLY that you want the precedence of the expression to be the same as for this special marker instead of the usual precedence. @@ -934,7 +927,6 @@ rule that appears first in the grammar file. Reduce/reduce conflicts are almost always caused when different sets of grammar rules somehow generate the same set of symbols. For example:: - assignment : ID EQUALS NUMBER | ID EQUALS expr @@ -950,7 +942,7 @@ In this case, a reduce/reduce conflict exists between these two rules:: assignment : ID EQUALS NUMBER expr : NUMBER -For example, if you wrote "a = 5", the parser can't figure out if this +For example, if you're parsing ``a = 5``, the parser can't figure out if this is supposed to be reduced as ``assignment : ID EQUALS NUMBER`` or whether it's supposed to reduce the 5 as an expression and then reduce the rule ``assignment : ID EQUALS expr``. @@ -975,1200 +967,401 @@ Parser Debugging Tracking down shift/reduce and reduce/reduce conflicts is one of the finer pleasures of using an LR parsing algorithm. To assist in -debugging, SLY creates a debugging file called 'parser.out' when it -generates the parsing table. The contents of this file look like the -following: - -
-- -The different states that appear in this file are a representation of -every possible sequence of valid input tokens allowed by the grammar. -When receiving input tokens, the parser is building up a stack and -looking for matching rules. Each state keeps track of the grammar -rules that might be in the process of being matched at that point. Within each -rule, the "." character indicates the current location of the parse -within that rule. In addition, the actions for each valid input token -are listed. When a shift/reduce or reduce/reduce conflict arises, -rules not selected are prefixed with an !. For example: - --Unused terminals: - - -Grammar - -Rule 1 expression -> expression PLUS expression -Rule 2 expression -> expression MINUS expression -Rule 3 expression -> expression TIMES expression -Rule 4 expression -> expression DIVIDE expression -Rule 5 expression -> NUMBER -Rule 6 expression -> LPAREN expression RPAREN - -Terminals, with rules where they appear - -TIMES : 3 -error : -MINUS : 2 -RPAREN : 6 -LPAREN : 6 -DIVIDE : 4 -PLUS : 1 -NUMBER : 5 - -Nonterminals, with rules where they appear - -expression : 1 1 2 2 3 3 4 4 6 0 - - -Parsing method: LALR - - -state 0 - - S' -> . expression - expression -> . expression PLUS expression - expression -> . expression MINUS expression - expression -> . expression TIMES expression - expression -> . expression DIVIDE expression - expression -> . NUMBER - expression -> . LPAREN expression RPAREN - - NUMBER shift and go to state 3 - LPAREN shift and go to state 2 - - -state 1 - - S' -> expression . - expression -> expression . PLUS expression - expression -> expression . MINUS expression - expression -> expression . TIMES expression - expression -> expression . DIVIDE expression - - PLUS shift and go to state 6 - MINUS shift and go to state 5 - TIMES shift and go to state 4 - DIVIDE shift and go to state 7 - - -state 2 - - expression -> LPAREN . expression RPAREN - expression -> . expression PLUS expression - expression -> . expression MINUS expression - expression -> . expression TIMES expression - expression -> . expression DIVIDE expression - expression -> . NUMBER - expression -> . LPAREN expression RPAREN - - NUMBER shift and go to state 3 - LPAREN shift and go to state 2 - - -state 3 - - expression -> NUMBER . - - $ reduce using rule 5 - PLUS reduce using rule 5 - MINUS reduce using rule 5 - TIMES reduce using rule 5 - DIVIDE reduce using rule 5 - RPAREN reduce using rule 5 - - -state 4 - - expression -> expression TIMES . expression - expression -> . expression PLUS expression - expression -> . expression MINUS expression - expression -> . expression TIMES expression - expression -> . expression DIVIDE expression - expression -> . NUMBER - expression -> . LPAREN expression RPAREN - - NUMBER shift and go to state 3 - LPAREN shift and go to state 2 - - -state 5 - - expression -> expression MINUS . expression - expression -> . expression PLUS expression - expression -> . expression MINUS expression - expression -> . expression TIMES expression - expression -> . expression DIVIDE expression - expression -> . NUMBER - expression -> . LPAREN expression RPAREN - - NUMBER shift and go to state 3 - LPAREN shift and go to state 2 - - -state 6 - - expression -> expression PLUS . expression - expression -> . expression PLUS expression - expression -> . expression MINUS expression - expression -> . expression TIMES expression - expression -> . expression DIVIDE expression - expression -> . NUMBER - expression -> . LPAREN expression RPAREN - - NUMBER shift and go to state 3 - LPAREN shift and go to state 2 - - -state 7 - - expression -> expression DIVIDE . expression - expression -> . expression PLUS expression - expression -> . expression MINUS expression - expression -> . expression TIMES expression - expression -> . expression DIVIDE expression - expression -> . NUMBER - expression -> . LPAREN expression RPAREN - - NUMBER shift and go to state 3 - LPAREN shift and go to state 2 - - -state 8 - - expression -> LPAREN expression . RPAREN - expression -> expression . PLUS expression - expression -> expression . MINUS expression - expression -> expression . TIMES expression - expression -> expression . DIVIDE expression - - RPAREN shift and go to state 13 - PLUS shift and go to state 6 - MINUS shift and go to state 5 - TIMES shift and go to state 4 - DIVIDE shift and go to state 7 - - -state 9 - - expression -> expression TIMES expression . - expression -> expression . PLUS expression - expression -> expression . MINUS expression - expression -> expression . TIMES expression - expression -> expression . DIVIDE expression - - $ reduce using rule 3 - PLUS reduce using rule 3 - MINUS reduce using rule 3 - TIMES reduce using rule 3 - DIVIDE reduce using rule 3 - RPAREN reduce using rule 3 - - ! PLUS [ shift and go to state 6 ] - ! MINUS [ shift and go to state 5 ] - ! TIMES [ shift and go to state 4 ] - ! DIVIDE [ shift and go to state 7 ] - -state 10 - - expression -> expression MINUS expression . - expression -> expression . PLUS expression - expression -> expression . MINUS expression - expression -> expression . TIMES expression - expression -> expression . DIVIDE expression - - $ reduce using rule 2 - PLUS reduce using rule 2 - MINUS reduce using rule 2 - RPAREN reduce using rule 2 - TIMES shift and go to state 4 - DIVIDE shift and go to state 7 - - ! TIMES [ reduce using rule 2 ] - ! DIVIDE [ reduce using rule 2 ] - ! PLUS [ shift and go to state 6 ] - ! MINUS [ shift and go to state 5 ] - -state 11 - - expression -> expression PLUS expression . - expression -> expression . PLUS expression - expression -> expression . MINUS expression - expression -> expression . TIMES expression - expression -> expression . DIVIDE expression - - $ reduce using rule 1 - PLUS reduce using rule 1 - MINUS reduce using rule 1 - RPAREN reduce using rule 1 - TIMES shift and go to state 4 - DIVIDE shift and go to state 7 - - ! TIMES [ reduce using rule 1 ] - ! DIVIDE [ reduce using rule 1 ] - ! PLUS [ shift and go to state 6 ] - ! MINUS [ shift and go to state 5 ] - -state 12 - - expression -> expression DIVIDE expression . - expression -> expression . PLUS expression - expression -> expression . MINUS expression - expression -> expression . TIMES expression - expression -> expression . DIVIDE expression - - $ reduce using rule 4 - PLUS reduce using rule 4 - MINUS reduce using rule 4 - TIMES reduce using rule 4 - DIVIDE reduce using rule 4 - RPAREN reduce using rule 4 - - ! PLUS [ shift and go to state 6 ] - ! MINUS [ shift and go to state 5 ] - ! TIMES [ shift and go to state 4 ] - ! DIVIDE [ shift and go to state 7 ] - -state 13 - - expression -> LPAREN expression RPAREN . - - $ reduce using rule 6 - PLUS reduce using rule 6 - MINUS reduce using rule 6 - TIMES reduce using rule 6 - DIVIDE reduce using rule 6 - RPAREN reduce using rule 6 --
-- -By looking at these rules (and with a little practice), you can usually track down the source -of most parsing conflicts. It should also be stressed that not all shift-reduce conflicts are -bad. However, the only way to be sure that they are resolved correctly is to look at parser.out. +debugging, you can have SLY produce a debugging file when it +constructs the parsing tables. Add a ``debugfile`` attribute to your +class like this:: + + class CalcParser(Parser): + debugfile = 'parser.out' + ... + +When present, this will write the entire grammar along with all parsing +states to the file you specify. Each state of the parser is shown +as output that looks something like this:: + + state 2 + + (7) factor -> LPAREN . expr RPAREN + (1) expr -> . term + (2) expr -> . expr MINUS term + (3) expr -> . expr PLUS term + (4) term -> . factor + (5) term -> . term DIVIDE factor + (6) term -> . term TIMES factor + (7) factor -> . LPAREN expr RPAREN + (8) factor -> . NUMBER + LPAREN shift and go to state 2 + NUMBER shift and go to state 3 + + factor shift and go to state 1 + term shift and go to state 4 + expr shift and go to state 6 + +Each state keeps track of the grammar rules that might be in the +process of being matched at that point. Within each rule, the "." +character indicates the current location of the parse within that +rule. In addition, the actions for each valid input token are listed. +By looking at these rules (and with a little practice), you can +usually track down the source of most parsing conflicts. It should +also be stressed that not all shift-reduce conflicts are bad. +However, the only way to be sure that they are resolved correctly is +to look at the debugging file. -- ! TIMES [ reduce using rule 2 ] - ! DIVIDE [ reduce using rule 2 ] - ! PLUS [ shift and go to state 6 ] - ! MINUS [ shift and go to state 5 ] --
-When a syntax error occurs, yacc.py performs the following steps: +When a syntax error occurs, SLY performs the following steps: -
-
-
-
-
-
-+To account for the possibility of a bad expression, you might write an +additional grammar rule like this:: -To account for the possibility of a bad expression, you might write an additional grammar rule like this: + @_('PRINT error SEMI') + def statement(self, p): + print("Syntax error in print statement. Bad expression") --def p_statement_print(p): - 'statement : PRINT expr SEMI' - ... --
-
-def p_statement_print_error(p):
- 'statement : PRINT error SEMI'
- print("Syntax error in print statement. Bad expression")
-
-
-
-
-In this case, the error token will match any sequence of
+In this case, the ``error`` token will match any sequence of
tokens that might appear up to the first semicolon that is
encountered. Once the semicolon is reached, the rule will be
-invoked and the error token will go away.
+invoked and the ``error`` token will go away.
-This type of recovery is sometimes known as parser resynchronization. -The error token acts as a wildcard for any bad input text and -the token immediately following error acts as a +The ``error`` token acts as a wildcard for any bad input text and +the token immediately following ``error`` acts as a synchronization token. -
-It is important to note that the error token usually does not appear as the last token -on the right in an error rule. For example: +It is important to note that the ``error`` token usually does not +appear as the last token on the right in an error rule. For example:: -
-
-def p_statement_print_error(p):
- 'statement : PRINT error'
- print("Syntax error in print statement. Bad expression")
-
-
+ @_('PRINT error')
+ def statement(self, p):
+ print("Syntax error in print statement. Bad expression")
This is because the first bad token encountered will cause the rule to
be reduced--which may make it difficult to recover if more bad tokens
immediately follow.
--Panic mode recovery is implemented entirely in the p_error() function. For example, this -function starts discarding tokens until it reaches a closing '}'. Then, it restarts the -parser in its initial state. + def error(self, p): + print("Whoa. You are seriously hosed.") + if not p: + print("End of File!") + return -
-
-def p_error(p):
- print("Whoa. You are seriously hosed.")
- if not p:
- print("End of File!")
- return
+ # Read ahead looking for a closing '}'
+ while True:
+ tok = next(self.tokens, None)
+ if not tok or tok.type == 'RBRACE':
+ break
+ self.restart()
- # Read ahead looking for a closing '}'
- while True:
- tok = parser.token() # Get the next token
- if not tok or tok.type == 'RBRACE':
- break
- parser.restart()
-
-
+This function simply discards the bad token and tells the parser that
+the error was ok::
--This function simply discards the bad token and tells the parser that the error was ok. + def error(self, p): + if p: + print("Syntax error at token", p.type) + # Just discard the token and tell the parser it's okay. + self.errok() + else: + print("Syntax error at EOF") -
-
-def p_error(p):
- if p:
- print("Syntax error at token", p.type)
- # Just discard the token and tell the parser it's okay.
- parser.errok()
- else:
- print("Syntax error at EOF")
-
-
+A few additional details about some of the attributes and methods being used:
--More information on these methods is as follows: -
+- ``self.errok()``. This resets the parser state so it doesn't think + it's in error-recovery mode. This will prevent an ``error`` token + from being generated and will reset the internal error counters so + that the next syntax error will call ``error()`` again. --
-
-
-To supply the next lookahead token to the parser, p_error() can return a token. This might be -useful if trying to synchronize on special characters. For example: + def error(self, tok): + # Read ahead looking for a terminating ";" + while True: + tok = next(self.tokens, None) # Get the next token + if not tok or tok.type == 'SEMI': + break + self.errok() -
-+When Do Syntax Errors Get Reported? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --def p_error(p): - # Read ahead looking for a terminating ";" - while True: - tok = parser.token() # Get the next token - if not tok or tok.type == 'SEMI': break - parser.errok() + # Return SEMI to the parser as the next lookahead token + return tok - # Return SEMI to the parser as the next lookahead token - return tok --
-Keep in mind in that the above error handling functions, -parser is an instance of the parser created by -yacc(). You'll need to save this instance someplace in your -code so that you can refer to it during error handling. -
+In most cases, SLY will handle errors as soon as a bad input token is +detected on the input. However, be aware that SLY may choose to delay +error handling until after it has reduced one or more grammar rules +first. This behavior might be unexpected, but it's related to special +states in the underlying parsing table known as "defaulted states." A +defaulted state is parsing condition where the same grammar rule will +be reduced regardless of what valid token comes next on the input. +For such states, SLY chooses to go ahead and reduce the grammar rule +*without reading the next input token*. If the next token is bad, SLY +will eventually get around to reading it and report a syntax error. +It's just a little unusual in that you might see some of your grammar +rules firing immediately prior to the syntax error. --- -The effect of raising SyntaxError is the same as if the last symbol shifted onto the -parsing stack was actually a syntax error. Thus, when you do this, the last symbol shifted is popped off -of the parsing stack and the current lookahead token is set to an error token. The parser -then enters error-recovery mode where it tries to reduce rules that can accept error tokens. -The steps that follow from this point are exactly the same as if a syntax error were detected and -p_error() were called. - --def p_production(p): - 'production : some production ...' - raise SyntaxError --
-One important aspect of manually setting an error is that the p_error() function will NOT be -called in this case. If you need to issue an error message, make sure you do it in the production that -raises SyntaxError. - -
-Note: This feature of PLY is meant to mimic the behavior of the YYERROR macro in yacc. - -
-In most cases, yacc will handle errors as soon as a bad input token is -detected on the input. However, be aware that yacc may choose to -delay error handling until after it has reduced one or more grammar -rules first. This behavior might be unexpected, but it's related to -special states in the underlying parsing table known as "defaulted -states." A defaulted state is parsing condition where the same -grammar rule will be reduced regardless of what valid token -comes next on the input. For such states, yacc chooses to go ahead -and reduce the grammar rule without reading the next input -token. If the next token is bad, yacc will eventually get around to reading it and -report a syntax error. It's just a little unusual in that you might -see some of your grammar rules firing immediately prior to the syntax -error. -
- --Usually, the delayed error reporting with defaulted states is harmless -(and there are other reasons for wanting PLY to behave in this way). -However, if you need to turn this behavior off for some reason. You -can clear the defaulted states table like this: -
- -
-
-parser = yacc.yacc()
-parser.defaulted_states = {}
-
-
-
--Disabling defaulted states is not recommended if your grammar makes use -of embedded actions as described in Section 6.11.
- --- -As an optional feature, yacc.py can automatically track line -numbers and positions for all of the grammar symbols as well. -However, this extra tracking requires extra processing and can -significantly slow down parsing. Therefore, it must be enabled by -passing the -tracking=True option to yacc.parse(). For example: - --def p_expression(p): - 'expression : expression PLUS expression' - line = p.lineno(2) # line number of the PLUS token - index = p.lexpos(2) # Position of the PLUS token --
-- -Once enabled, the lineno() and lexpos() methods work -for all grammar symbols. In addition, two additional methods can be -used: - --yacc.parse(data,tracking=True) --
-- -Note: The lexspan() function only returns the range of values up to the start of the last grammar symbol. - --def p_expression(p): - 'expression : expression PLUS expression' - p.lineno(1) # Line number of the left expression - p.lineno(2) # line number of the PLUS operator - p.lineno(3) # line number of the right expression - ... - start,end = p.linespan(3) # Start,end lines of the right expression - starti,endi = p.lexspan(3) # Start,end positions of right expression - --
-Although it may be convenient for PLY to track position information on -all grammar symbols, this is often unnecessary. For example, if you -are merely using line number information in an error message, you can -often just key off of a specific token in the grammar rule. For -example: - -
-
-def p_bad_func(p):
- 'funccall : fname LPAREN error RPAREN'
- # Line number reported from LPAREN token
- print("Bad function call at line", p.lineno(2))
-
-
-
--Similarly, you may get better parsing performance if you only -selectively propagate line number information where it's needed using -the p.set_lineno() method. For example: - -
-- -PLY doesn't retain line number information from rules that have already been -parsed. If you are building an abstract syntax tree and need to have line numbers, -you should make sure that the line numbers appear in the tree itself. - --def p_fname(p): - 'fname : ID' - p[0] = p[1] - p.set_lineno(0,p.lineno(1)) --
A minimal way to construct a tree is to simply create and +AST Construction +^^^^^^^^^^^^^^^^ + +SLY provides no special functions for constructing an abstract syntax +tree. However, such construction is easy enough to do on your own. + +A minimal way to construct a tree is to simply create and propagate a tuple or list in each grammar rule function. There are many possible ways to do this, but one example would be something -like this: +like this:: -
-
-def p_expression_binop(p):
- '''expression : expression PLUS expression
- | expression MINUS expression
- | expression TIMES expression
- | expression DIVIDE expression'''
+ @_('expr PLUS expr',
+ 'expr MINUS expr',
+ 'expr TIMES expr',
+ 'expr DIVIDE expr')
+ def expr(self, p):
+ return ('binary-expression', p[1], p[0], p[2])
- p[0] = ('binary-expression',p[2],p[1],p[3])
+ @_('LPAREN expr RPAREN')
+ def expr(self, p):
+ return ('group-expression',p[1])
-def p_expression_group(p):
- 'expression : LPAREN expression RPAREN'
- p[0] = ('group-expression',p[2])
+ @_('NUMBER')
+ def expr(self, p):
+ return ('number-expression', p[0])
-def p_expression_number(p):
- 'expression : NUMBER'
- p[0] = ('number-expression',p[1])
-
-
-
-Another approach is to create a set of data structure for different -kinds of abstract syntax tree nodes and assign nodes to p[0] -in each rule. For example: +kinds of abstract syntax tree nodes and create different node types +in each rule:: -
-
-class Expr: pass
+ class Expr:
+ pass
-class BinOp(Expr):
- def __init__(self,left,op,right):
- self.type = "binop"
- self.left = left
- self.right = right
- self.op = op
+ class BinOp(Expr):
+ def __init__(self, op, left, right)
+ self.op = op
+ self.left = left
+ self.right = right
-class Number(Expr):
- def __init__(self,value):
- self.type = "number"
- self.value = value
+ class Number(Expr):
+ def __init__(self, value):
+ self.value = value
-def p_expression_binop(p):
- '''expression : expression PLUS expression
- | expression MINUS expression
- | expression TIMES expression
- | expression DIVIDE expression'''
+ @_('expr PLUS expr',
+ 'expr MINUS expr',
+ 'expr TIMES expr',
+ 'expr DIVIDE expr')
+ def expr(self, p):
+ return BinOp(p[1], p[0], p[2])
- p[0] = BinOp(p[1],p[2],p[3])
+ @_('LPAREN expr RPAREN')
+ def expr(self, p):
+ return p[1]
-def p_expression_group(p):
- 'expression : LPAREN expression RPAREN'
- p[0] = p[2]
+ @_('NUMBER')
+ def expr(self, p):
+ return Number(p[0])
-def p_expression_number(p):
- 'expression : NUMBER'
- p[0] = Number(p[1])
-
-
+The advantage to this approach is that it may make it easier to attach
+more complicated semantics, type checking, code generation, and other
+features to the node classes.
-The advantage to this approach is that it may make it easier to attach more complicated
-semantics, type checking, code generation, and other features to the node classes.
+Embedded Actions
+^^^^^^^^^^^^^^^^
--To simplify tree traversal, it may make sense to pick a very generic -tree structure for your parse tree nodes. For example: +The parsing technique used by SLY only allows actions to be executed +at the end of a rule. For example, suppose you have a rule like this:: -
-
-class Node:
- def __init__(self,type,children=None,leaf=None):
- self.type = type
- if children:
- self.children = children
- else:
- self.children = [ ]
- self.leaf = leaf
-
-def p_expression_binop(p):
- '''expression : expression PLUS expression
- | expression MINUS expression
- | expression TIMES expression
- | expression DIVIDE expression'''
+ @_('A B C D')
+ def foo(self, p):
+ print("Parsed a foo", p[0],p[1],p[2],p[3])
- p[0] = Node("binop", [p[1],p[3]], p[2])
-
-
-
-
-
-def p_foo(p):
- "foo : A B C D"
- print("Parsed a foo", p[1],p[2],p[3],p[4])
-
-
-
-In this case, the supplied action code only executes after all of the -symbols A, B, C, and D have been +symbols ``A``, ``B``, ``C``, and ``D`` have been parsed. Sometimes, however, it is useful to execute small code fragments during intermediate stages of parsing. For example, suppose -you wanted to perform some action immediately after A has -been parsed. To do this, write an empty rule like this: +you wanted to perform some action immediately after ``A`` has +been parsed. To do this, write an empty rule like this:: -
-
-def p_foo(p):
- "foo : A seen_A B C D"
- print("Parsed a foo", p[1],p[3],p[4],p[5])
- print("seen_A returned", p[2])
+ @_('A seen_A B C D')
+ def foo(self, p):
+ print("Parsed a foo", p[0],p[2],p[3],p[4])
+ print("seen_A returned", p[1])
-def p_seen_A(p):
- "seen_A :"
- print("Saw an A = ", p[-1]) # Access grammar symbol to left
- p[0] = some_value # Assign value to seen_A
+ @_('')
+ def seen_A(self, p):
+ print("Saw an A = ", p[-1]) # Access grammar symbol to the left
+ return 'some_value' # Assign value to seen_A
-
-
+In this example, the empty ``seen_A`` rule executes immediately after
+``A`` is shifted onto the parsing stack. Within this rule, ``p[-1]``
+refers to the symbol on the stack that appears immediately to the left
+of the ``seen_A`` symbol. In this case, it would be the value of
+``A`` in the ``foo`` rule immediately above. Like other rules, a
+value can be returned from an embedded action by returning it.
--In this example, the empty seen_A rule executes immediately -after A is shifted onto the parsing stack. Within this -rule, p[-1] refers to the symbol on the stack that appears -immediately to the left of the seen_A symbol. In this case, -it would be the value of A in the foo rule -immediately above. Like other rules, a value can be returned from an -embedded action by simply assigning it to p[0] +The use of embedded actions can sometimes introduce extra shift/reduce +conflicts. For example, this grammar has no conflicts:: -
-The use of embedded actions can sometimes introduce extra shift/reduce conflicts. For example, -this grammar has no conflicts: + @_('abcd', + 'abcx') + def foo(self, p): + pass -
-
-def p_foo(p):
- """foo : abcd
- | abcx"""
+ @_('A B C D')
+ def abcd(self, p):
+ pass
-def p_abcd(p):
- "abcd : A B C D"
+ @_('A B C X')
+ def abcx(self, p):
+ pass
-def p_abcx(p):
- "abcx : A B C X"
-
-
+However, if you insert an embedded action into one of the rules like this::
-However, if you insert an embedded action into one of the rules like this,
+ @_('abcd',
+ 'abcx')
+ def foo(self, p):
+ pass
-
-
-def p_foo(p):
- """foo : abcd
- | abcx"""
+ @_('A B C D')
+ def abcd(self, p):
+ pass
-def p_abcd(p):
- "abcd : A B C D"
+ @_('A B seen_AB C X')
+ def abcx(self, p):
+ pass
-def p_abcx(p):
- "abcx : A B seen_AB C X"
-
-def p_seen_AB(p):
- "seen_AB :"
-
-
+ @_('')
+ def seen_AB(self, p):
+ pass
an extra shift-reduce conflict will be introduced. This conflict is
-caused by the fact that the same symbol C appears next in
-both the abcd and abcx rules. The parser can either
-shift the symbol (abcd rule) or reduce the empty
-rule seen_AB (abcx rule).
+caused by the fact that the same symbol ``C`` appears next in
+both the ``abcd`` and ``abcx`` rules. The parser can either
+shift the symbol (``abcd`` rule) or reduce the empty
+rule ``seen_AB`` (``abcx`` rule).
-A common use of embedded rules is to control other aspects of parsing -such as scoping of local variables. For example, if you were parsing C code, you might -write code like this: +such as scoping of local variables. For example, if you were parsing +C code, you might write code like this:: -
-
-def p_statements_block(p):
- "statements: LBRACE new_scope statements RBRACE"""
- # Action code
- ...
- pop_scope() # Return to previous scope
+ @_('LBRACE new_scope statements RBRACE')
+ def statements(self, p):
+ # Action code
+ ...
+ pop_scope() # Return to previous scope
+
+ @_('')
+ def new_scope(self, p):
+ # Create a new scope for local variables
+ create_scope()
+ ...
-def p_new_scope(p):
- "new_scope :"
- # Create a new scope for local variables
- s = new_scope()
- push_scope(s)
- ...
-
-
-
-In this case, the embedded action new_scope executes
-immediately after a LBRACE ({) symbol is parsed.
+In this case, the embedded action ``new_scope`` executes
+immediately after a ``LBRACE`` (``{``) symbol is parsed.
This might adjust internal symbol tables and other aspects of the
-parser. Upon completion of the rule statements_block, code
-might undo the operations performed in the embedded action
-(e.g., pop_scope()).
-
---in this case, x must be a Lexer object that minimally has a x.token() method for retrieving the next -token. If an input string is given to yacc.parse(), the lexer must also have an x.input() method. - --parser = yacc.parse(lexer=x) --
-
-- --parser = yacc.yacc(debug=False) --
-
-- --parser = yacc.yacc(tabmodule="foo") --
-Normally, the parsetab.py file is placed into the same directory as -the module where the parser is defined. If you want it to go somewhere else, you can -given an absolute package name for tabmodule instead. In that case, the -tables will be written there. -
- --
-- --parser = yacc.yacc(tabmodule="foo",outputdir="somedirectory") --
-Note: Be aware that unless the directory specified is also on Python's path (sys.path), subsequent -imports of the table file will fail. As a general rule, it's better to specify a destination using the -tabmodule argument instead of directly specifying a directory using the outputdir argument. -
- --
-- -Note: If you disable table generation, yacc() will regenerate the parsing tables -each time it runs (which may take awhile depending on how large your grammar is). - --parser = yacc.yacc(write_tables=False) --
-
-- --parser = yacc.parse(debug=True) --
-
-It should be noted that table generation is reasonably efficient, even for grammars that involve around a 100 rules -and several hundred states.
-
-
-
-- --from functools import wraps -from nodes import Collection - - -def strict(*types): - def decorate(func): - @wraps(func) - def wrapper(p): - func(p) - if not isinstance(p[0], types): - raise TypeError - - wrapper.co_firstlineno = func.__code__.co_firstlineno - return wrapper - - return decorate - -@strict(Collection) -def p_collection(p): - """ - collection : sequence - | map - """ - p[0] = p[1] --
-As a general rules this isn't a problem. However, to make it work, -you need to carefully make sure everything gets hooked up correctly. -First, make sure you save the objects returned by lex() and -yacc(). For example: - -
-- -Next, when parsing, make sure you give the parse() function a reference to the lexer it -should be using. For example: - --lexer = lex.lex() # Return lexer object -parser = yacc.yacc() # Return parser object --
-- -If you forget to do this, the parser will use the last lexer -created--which is not always what you want. - --parser.parse(text,lexer=lexer) --
-Within lexer and parser rule functions, these objects are also -available. In the lexer, the "lexer" attribute of a token refers to -the lexer object that triggered the rule. For example: - -
-- -In the parser, the "lexer" and "parser" attributes refer to the lexer -and parser objects respectively. - --def t_NUMBER(t): - r'\d+' - ... - print(t.lexer) # Show lexer object --
-- -If necessary, arbitrary attributes can be attached to the lexer or parser object. -For example, if you wanted to have different parsing modes, you could attach a mode -attribute to the parser object and look at it later. - --def p_expr_plus(p): - 'expr : expr PLUS expr' - ... - print(p.parser) # Show parser object - print(p.lexer) # Show lexer object --
-- -then PLY can later be used when Python runs in optimized mode. To make this work, -make sure you first run Python in normal mode. Once the lexing and parsing tables -have been generated the first time, run Python in optimized mode. PLY will use -the tables without the need for doc strings. - --lex.lex(optimize=1) -yacc.yacc(optimize=1) --
-Beware: running PLY in optimized mode disables a lot of error -checking. You should only do this when your project has stabilized -and you don't need to do any debugging. One of the purposes of -optimized mode is to substantially decrease the startup time of -your compiler (by assuming that everything is already properly -specified and works). - -
-Debugging a compiler is typically not an easy task. PLY provides some -advanced diagostic capabilities through the use of Python's -logging module. The next two sections describe this: - -
-Both the lex() and yacc() commands have a debugging -mode that can be enabled using the debug flag. For example: - -
-- -Normally, the output produced by debugging is routed to either -standard error or, in the case of yacc(), to a file -parser.out. This output can be more carefully controlled -by supplying a logging object. Here is an example that adds -information about where different debugging messages are coming from: - --lex.lex(debug=True) -yacc.yacc(debug=True) --
-- -If you supply a custom logger, the amount of debugging -information produced can be controlled by setting the logging level. -Typically, debugging messages are either issued at the DEBUG, -INFO, or WARNING levels. - --# Set up a logging object -import logging -logging.basicConfig( - level = logging.DEBUG, - filename = "parselog.txt", - filemode = "w", - format = "%(filename)10s:%(lineno)4d:%(message)s" -) -log = logging.getLogger() - -lex.lex(debug=True,debuglog=log) -yacc.yacc(debug=True,debuglog=log) --
-PLY's error messages and warnings are also produced using the logging -interface. This can be controlled by passing a logging object -using the errorlog parameter. - -
-- -If you want to completely silence warnings, you can either pass in a -logging object with an appropriate filter level or use the NullLogger -object defined in either lex or yacc. For example: - --lex.lex(errorlog=log) -yacc.yacc(errorlog=log) --
-- --yacc.yacc(errorlog=yacc.NullLogger()) --
-To enable run-time debugging of a parser, use the debug option to parse. This -option can either be an integer (which simply turns debugging on or off) or an instance -of a logger object. For example: - -
-- -If a logging object is passed, you can use its filtering level to control how much -output gets generated. The INFO level is used to produce information -about rule reductions. The DEBUG level will show information about the -parsing stack, token shifts, and other details. The ERROR level shows information -related to parsing errors. - --log = logging.getLogger() -parser.parse(input,debug=log) --
-For very complicated problems, you should pass in a logging object that -redirects to a file where you can more easily inspect the output after -execution. - -