Doc updates

This commit is contained in:
David Beazley 2016-09-14 12:33:50 -05:00
parent 54ff0c3851
commit 5c3083712f

View File

@ -45,12 +45,12 @@ The next two parts describe the basics.
Writing a Lexer Writing a Lexer
--------------- ---------------
Suppose you're writing a programming language and a user supplied the Suppose you're writing a programming language and you wanted to parse the
following input string:: following input string::
x = 3 + 42 * (s - t) x = 3 + 42 * (s - t)
The first step of any parsing is to break the text into tokens where The first step of parsing is to break the text into tokens where
each token has a type and value. For example, the above text might be each token has a type and value. For example, the above text might be
described by the following list of token tuples:: described by the following list of token tuples::
@ -195,36 +195,14 @@ comments and newlines::
from sly import Lexer from sly import Lexer
class CalcLexer(Lexer): class CalcLexer(Lexer):
# Set of token names. This is always required ...
tokens = {
'NUMBER',
'ID',
'PLUS',
'MINUS',
'TIMES',
'DIVIDE',
'ASSIGN',
'LPAREN',
'RPAREN',
}
# String containing ignored characters (between tokens) # String containing ignored characters (between tokens)
ignore = ' \t' ignore = ' \t'
# Regular expression rules for tokens # Other ignored patterns
PLUS = r'\+'
MINUS = r'-'
TIMES = r'\*'
DIVIDE = r'/'
LPAREN = r'\('
RPAREN = r'\)'
ASSIGN = r'='
NUMBER = r'\d+'
ID = r'[a-zA-Z_][a-zA-Z0-9_]*'
# Ignored text
ignore_comment = r'\#.*' ignore_comment = r'\#.*'
ignore_newline = r'\n+' ignore_newline = r'\n+'
...
if __name__ == '__main__': if __name__ == '__main__':
data = '''x = 3 + 42 data = '''x = 3 + 42
@ -250,12 +228,25 @@ expression using the ``@_()`` decorator like this::
return t return t
The method always takes a single argument which is an instance of The method always takes a single argument which is an instance of
``Token``. By default, ``t.type`` is set to the name of the token type ``Token``. By default, ``t.type`` is set to the name of the token
(e.g., ``'NUMBER'``). The function can change the token type and (e.g., ``'NUMBER'``). The function can change the token type and
value as it sees appropriate. When finished, the resulting token value as it sees appropriate. When finished, the resulting token
object should be returned as a result. If no value is returned by the object should be returned as a result. If no value is returned by the
function, the token is discarded and the next token read. function, the token is discarded and the next token read.
The ``@_()`` decorator is defined automatically within the ``Lexer``
class--you don't need to do any kind of special import for it.
It can also accept multiple regular expression rules. For example::
@_(r'0x[0-9a-fA-F]+',
r'\d+')
def NUMBER(self, t):
if t.value.startswith('0x'):
t.value = int(t.value[2:], 16)
else:
t.value = int(t.value)
return t
Instead of using the ``@_()`` decorator, you can also write a method Instead of using the ``@_()`` decorator, you can also write a method
that matches the same name as a token previously specified as a that matches the same name as a token previously specified as a
string. For example:: string. For example::
@ -269,8 +260,9 @@ string. For example::
return t return t
This is potentially useful trick for debugging a lexer. You can temporarily This is potentially useful trick for debugging a lexer. You can temporarily
attach a method to fire when a token is seen and take it away later without attach a method a token and have it execute when the token is encountered.
changing any existing part of the lexer class. If you later take the method away, the lexer will revert back to its original
behavior.
Line numbers and position tracking Line numbers and position tracking
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -279,21 +271,21 @@ By default, lexers know nothing about line numbers. This is because
they don't know anything about what constitutes a "line" of input they don't know anything about what constitutes a "line" of input
(e.g., the newline character or even if the input is textual data). (e.g., the newline character or even if the input is textual data).
To update this information, you need to add a special rule for newlines. To update this information, you need to add a special rule for newlines.
Promote the ``ignore_newline`` token to a method like this:: Promote the ``ignore_newline`` rule to a method like this::
# Define a rule so we can track line numbers # Define a rule so we can track line numbers
@_(r'\n+') @_(r'\n+')
def ignore_newline(self, t): def ignore_newline(self, t):
self.lineno += len(t.value) self.lineno += len(t.value)
Within the rule, the lineno attribute of the lexer is updated. After Within the rule, the lineno attribute of the lexer is now updated.
the line number is updated, the token is simply discarded since After the line number is updated, the token is discarded since nothing
nothing is returned. is returned.
Lexers do not perform and kind of automatic column tracking. However, Lexers do not perform and kind of automatic column tracking. However,
it does record positional information related to each token in the it does record positional information related to each token in the token's
``index`` attribute. Using this, it is usually possible to compute ``index`` attribute. Using this, it is usually possible to compute
column information as a separate step. For instance, you could count column information as a separate step. For instance, you can search
backwards until you reach the previous newline:: backwards until you reach the previous newline::
# Compute column. # Compute column.
@ -321,7 +313,7 @@ Literal characters can be specified by defining a set
literals = { '+','-','*','/' } literals = { '+','-','*','/' }
... ...
A literal character is simply a single character that is returned "as A literal character is a *single character* that is returned "as
is" when encountered by the lexer. Literals are checked after all of is" when encountered by the lexer. Literals are checked after all of
the defined regular expression rules. Thus, if a rule starts with one the defined regular expression rules. Thus, if a rule starts with one
of the literal characters, it will always take precedence. of the literal characters, it will always take precedence.
@ -369,7 +361,7 @@ For example::
print("Illegal character '%s'" % value[0]) print("Illegal character '%s'" % value[0])
self.index += 1 self.index += 1
In this case, we simply print the offending character and skip ahead In this case, we print the offending character and skip ahead
one character by updating the lexer position. Error handling in a one character by updating the lexer position. Error handling in a
parser is often a hard problem. An error handler might scan ahead parser is often a hard problem. An error handler might scan ahead
to a logical synchronization point such as a semicolon, a blank line, to a logical synchronization point such as a semicolon, a blank line,
@ -386,7 +378,8 @@ into practice::
from sly import Lexer from sly import Lexer
class CalcLexer(Lexer): class CalcLexer(Lexer):
reserved_words = { 'WHILE', 'PRINT' } # Set of reserved names (language keywords)
reserved_words = { 'WHILE', 'IF', 'ELSE', 'PRINT' }
# Set of token names. This is always required # Set of token names. This is always required
tokens = { tokens = {
@ -431,6 +424,7 @@ into practice::
@_(r'[a-zA-Z_][a-zA-Z0-9_]*') @_(r'[a-zA-Z_][a-zA-Z0-9_]*')
def ID(self, t): def ID(self, t):
# Check if name matches a reserved word (change token type if true)
if t.value.upper() in self.reserved_words: if t.value.upper() in self.reserved_words:
t.type = t.value.upper() t.type = t.value.upper()
return t return t
@ -487,15 +481,15 @@ Study this example closely. It might take a bit to digest, but all of the
essential parts of writing a lexer are there. Tokens have to be specified essential parts of writing a lexer are there. Tokens have to be specified
with regular expression rules. You can optionally attach actions that with regular expression rules. You can optionally attach actions that
execute when certain patterns are encountered. Certain features such as execute when certain patterns are encountered. Certain features such as
character literals might make it easier to lex certain kinds of text. character literals are there mainly for convenience, saving you the trouble
You can also add error handling. of writing separate regular expression rules. You can also add error handling.
Writing a Parser Writing a Parser
---------------- ----------------
The ``Parser`` class is used to parse language syntax. Before showing The ``Parser`` class is used to parse language syntax. Before showing
an example, there are a few important bits of background that must be an example, there are a few important bits of background that must be
mentioned. covered.
Parsing Background Parsing Background
^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^
@ -519,15 +513,19 @@ In the grammar, symbols such as ``NUMBER``, ``+``, ``-``, ``*``, and
``/`` are known as *terminals* and correspond to raw input tokens. ``/`` are known as *terminals* and correspond to raw input tokens.
Identifiers such as ``term`` and ``factor`` refer to grammar rules Identifiers such as ``term`` and ``factor`` refer to grammar rules
comprised of a collection of terminals and other rules. These comprised of a collection of terminals and other rules. These
identifiers are known as *non-terminals*. identifiers are known as *non-terminals*. The separation of the
grammar into different levels (e.g., ``expr`` and ``term``) encodes
the operator precedence rules for the different operations. In this
case, multiplication and division have higher precedence than addition
and subtraction.
The semantic behavior of a language is often specified using a The semantics of what happens during parsing is often specified using
technique known as syntax directed translation. In syntax directed a technique known as syntax directed translation. In syntax directed
translation, values are attached to each symbol in a given grammar translation, the symbols in the grammar become a kind of
rule along with an action. Whenever a particular grammar rule is object. Values can be attached each symbol and operations carried out
recognized, the action describes what to do. For example, given the on those values when different grammar rules are recognized. For
expression grammar above, you might write the specification for a example, given the expression grammar above, you might write the
simple calculator like this:: specification for the operation of a simple calculator like this::
Grammar Action Grammar Action
------------------------ -------------------------------- ------------------------ --------------------------------
@ -542,12 +540,23 @@ simple calculator like this::
factor : NUMBER factor.val = int(NUMBER.val) factor : NUMBER factor.val = int(NUMBER.val)
| ( expr ) factor.val = expr.val | ( expr ) factor.val = expr.val
A good way to think about syntax directed translation is to view each In this grammar, new values enter via the ``NUMBER`` token. Those
symbol in the grammar as a kind of object. Associated with each symbol values then propagate according to the actions described above. For
is a value representing its "state" (for example, the ``val`` example, ``factor.val = int(NUMBER.val)`` propagates the value from
attribute above). Semantic actions are then expressed as a collection ``NUMBER`` to ``factor``. ``term0.val = factor.val`` propagates the
of functions or methods that operate on the symbols and associated value from ``factor`` to ``term``. Rules such as ``expr0.val =
values. expr1.val + term1.val`` combine and propagate values further. Just to
illustrate, here is how values propagate in the expression ``2 + 3 * 4``::
NUMBER.val=2 + NUMBER.val=3 * NUMBER.val=4 # NUMBER -> factor
factor.val=2 + NUMBER.val=3 * NUMBER.val=4 # factor -> term
term.val=2 + NUMBER.val=3 * NUMBER.val=4 # term -> expr
expr.val=2 + NUMBER.val=3 * NUMBER.val=4 # NUMBER -> factor
expr.val=2 + factor.val=3 * NUMBER.val=4 # factor -> term
expr.val=2 + term.val=3 * NUMBER.val=4 # NUMBER -> factor
expr.val=2 + term.val=3 * factor.val=4 # term * factor -> term
expr.val=2 + term.val=12 # expr + term -> expr
expr.val=14
SLY uses a parsing technique known as LR-parsing or shift-reduce SLY uses a parsing technique known as LR-parsing or shift-reduce
parsing. LR parsing is a bottom up technique that tries to recognize parsing. LR parsing is a bottom up technique that tries to recognize
@ -659,10 +668,6 @@ SLY::
def factor(self, p): def factor(self, p):
return p[1] return p[1]
# Error rule for syntax errors
def error(self, p):
print("Syntax error in input!")
if __name__ == '__main__': if __name__ == '__main__':
lexer = CalcLexer() lexer = CalcLexer()
parser = CalcParser() parser = CalcParser()
@ -677,10 +682,10 @@ SLY::
In this example, each grammar rule is defined by a method that's been In this example, each grammar rule is defined by a method that's been
decorated by ``@_(rule)`` decorator. The very first grammar rule decorated by ``@_(rule)`` decorator. The very first grammar rule
defines the top of the parse. The name of each method should match defines the top of the parse (the first rule listed in a BNF grammar).
the name of the grammar rule being parsed. The argument to the The name of each method must match the name of the grammar rule being
``@_()`` decorator is a string describing the right-hand-side of the parsed. The argument to the ``@_()`` decorator is a string describing
grammar. Thus, a grammar rule like this:: the right-hand-side of the grammar. Thus, a grammar rule like this::
expr : expr PLUS term expr : expr PLUS term
@ -692,7 +697,7 @@ becomes a method like this::
The method is triggered when that grammar rule is recognized on the The method is triggered when that grammar rule is recognized on the
input. As an argument, the method receives a sequence of grammar symbol input. As an argument, the method receives a sequence of grammar symbol
values ``p`` that is accessed as an array. The mapping between values ``p`` that is accessed as an array of symbols. The mapping between
elements of ``p`` and the grammar rule is as shown here:: elements of ``p`` and the grammar rule is as shown here::
# p[0] p[1] p[2] # p[0] p[1] p[2]
@ -735,10 +740,8 @@ or perhaps create an instance related to an abstract syntax tree::
return BinOp('+', p[0], p[2]) return BinOp('+', p[0], p[2])
The key thing is that the method returns the value that's going to The key thing is that the method returns the value that's going to
be attached to the symbol "expr" in this case. be attached to the symbol "expr" in this case. This is the propagation
of values that was described in the previous section.
The ``error()`` method is defined to handle syntax errors (if any).
See the error handling section below for more detail.
Combining Grammar Rule Functions Combining Grammar Rule Functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -789,12 +792,12 @@ declaration::
class CalcLexer(Lexer): class CalcLexer(Lexer):
... ...
literals = ['+','-','*','/' ] literals = { '+','-','*','/' }
... ...
Character literals are limited to a single character. Thus, it is not Character literals are limited to a single character. Thus, it is not
legal to specify literals such as ``<=`` or ``==``. For this, use the legal to specify literals such as ``<=`` or ``==``. For this, use the
normal lexing rules (e.g., define a rule such as ``EQ = r'=='``). normal lexing rules (e.g., define a rule such as ``LE = r'<='``).
Empty Productions Empty Productions
^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^
@ -806,7 +809,19 @@ If you need an empty production, define a special rule like this::
pass pass
Now to use the empty production elsewhere, use the name 'empty' as a symbol. For Now to use the empty production elsewhere, use the name 'empty' as a symbol. For
example:: example, suppose you need to encode a rule that involved an optional item like this::
spam : optitem grok
optitem : item
| empty
You would encode the rules in SLY as follows::
@_('optitem grok')
def spam(self, p):
...
@_('item') @_('item')
def optitem(self, p): def optitem(self, p):
@ -816,32 +831,11 @@ example::
def optitem(self, p): def optitem(self, p):
... ...
Note: You can write empty rules anywhere by simply specifying an empty Note: You could write empty rules anywhere by specifying an empty
string. However,writing an "empty" rule and using "empty" to denote an string. However,writing an "empty" rule and using "empty" to denote an
empty production may be easier to read and more clearly state your empty production may be easier to read and more clearly state your
intention. intention.
Changing the starting symbol
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Normally, the first rule found in a parser class defines the starting
grammar rule (top level rule). To change this, supply a ``start``
specifier in your class. For example::
class CalcParser(Parser):
start = 'foo'
@_('A B')
def bar(self, p):
...
@_('bar X')
def foo(self, p): # Parsing starts here (start symbol above)
...
The use of a ``start`` specifier may be useful during debugging
since you can use it to work with a subset of a larger grammar.
Dealing With Ambiguous Grammars Dealing With Ambiguous Grammars
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -871,7 +865,7 @@ example, consider the string "3 * 4 + 5" and the internal parsing
stack:: stack::
Step Symbol Stack Input Tokens Action Step Symbol Stack Input Tokens Action
---- --------------------- --------------------- ------------------------------- ---- ------------- ---------------- -------------------------------
1 $ 3 * 4 + 5$ Shift 3 1 $ 3 * 4 + 5$ Shift 3
2 $ 3 * 4 + 5$ Reduce : expr : NUMBER 2 $ 3 * 4 + 5$ Reduce : expr : NUMBER
3 $ expr * 4 + 5$ Shift * 3 $ expr * 4 + 5$ Shift *
@ -944,11 +938,12 @@ associativity specifiers.
2. If the grammar rule on the stack has higher precedence, the rule is reduced. 2. If the grammar rule on the stack has higher precedence, the rule is reduced.
3. If the current token and the grammar rule have the same precedence, the 3. If the current token and the grammar rule have the same precedence,
rule is reduced for left associativity, whereas the token is shifted for right associativity. the rule is reduced for left associativity, whereas the token is
shifted for right associativity.
4. If nothing is known about the precedence, shift/reduce conflicts are resolved in 4. If nothing is known about the precedence, shift/reduce conflicts
favor of shifting (the default). are resolved in favor of shifting (the default).
For example, if ``expr PLUS expr`` has been parsed and the For example, if ``expr PLUS expr`` has been parsed and the
next token is ``TIMES``, the action is going to be a shift because next token is ``TIMES``, the action is going to be a shift because
@ -994,10 +989,11 @@ When you use the ``%prec`` qualifier, you're telling SLY
that you want the precedence of the expression to be the same as for that you want the precedence of the expression to be the same as for
this special marker instead of the usual precedence. this special marker instead of the usual precedence.
It is also possible to specify non-associativity in the ``precedence`` table. This would It is also possible to specify non-associativity in the ``precedence``
be used when you *don't* want operations to chain together. For example, suppose table. This is used when you *don't* want operations to chain
you wanted to support comparison operators like ``<`` and ``>`` but you didn't want to allow together. For example, suppose you wanted to support comparison
combinations like ``a < b < c``. To do this, specify a rule like this:: operators like ``<`` and ``>`` but you didn't want combinations like
``a < b < c``. To do this, specify the precedence rules like this::
class MyParser(Parser): class MyParser(Parser):
... ...
@ -1140,8 +1136,9 @@ When a syntax error occurs, SLY performs the following steps:
5. If a grammar rule accepts ``error`` as a token, it will be 5. If a grammar rule accepts ``error`` as a token, it will be
shifted onto the parsing stack. shifted onto the parsing stack.
6. If the top item of the parsing stack is ``error``, lookahead tokens will be discarded until the 6. If the top item of the parsing stack is ``error``, lookahead tokens
parser can successfully shift a new symbol or reduce a rule involving ``error``. will be discarded until the parser can successfully shift a new
symbol or reduce a rule involving ``error``.
Recovery and resynchronization with error rules Recovery and resynchronization with error rules
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -1181,7 +1178,9 @@ appear as the last token on the right in an error rule. For example::
This is because the first bad token encountered will cause the rule to This is because the first bad token encountered will cause the rule to
be reduced--which may make it difficult to recover if more bad tokens be reduced--which may make it difficult to recover if more bad tokens
immediately follow. immediately follow. It's better to have some kind of landmark such as
a semicolon, closing parenthesese, or other token that can be used as
a synchronization point.
Panic mode recovery Panic mode recovery
~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~
@ -1208,7 +1207,7 @@ state::
break break
self.restart() self.restart()
This function simply discards the bad token and tells the parser that This function discards the bad token and tells the parser that
the error was ok:: the error was ok::
def error(self, p): def error(self, p):
@ -1278,7 +1277,7 @@ Line Number and Position Tracking
Position tracking is often a tricky problem when writing compilers. Position tracking is often a tricky problem when writing compilers.
By default, SLY tracks the line number and position of all tokens. By default, SLY tracks the line number and position of all tokens.
The following attributes may be useful in a production method: The following attributes may be useful in a production rule:
- ``p.lineno``. Line number of the left-most terminal in a production. - ``p.lineno``. Line number of the left-most terminal in a production.
- ``p.index``. Lexing index of the left-most terminal in a production. - ``p.index``. Lexing index of the left-most terminal in a production.
@ -1301,9 +1300,9 @@ AST Construction
SLY provides no special functions for constructing an abstract syntax SLY provides no special functions for constructing an abstract syntax
tree. However, such construction is easy enough to do on your own. tree. However, such construction is easy enough to do on your own.
A minimal way to construct a tree is to simply create and A minimal way to construct a tree is to create and
propagate a tuple or list in each grammar rule function. There propagate a tuple or list in each grammar rule function. There
are many possible ways to do this, but one example would be something are many possible ways to do this, but one example is something
like this:: like this::
@_('expr PLUS expr', @_('expr PLUS expr',
@ -1357,6 +1356,27 @@ The advantage to this approach is that it may make it easier to attach
more complicated semantics, type checking, code generation, and other more complicated semantics, type checking, code generation, and other
features to the node classes. features to the node classes.
Changing the starting symbol
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Normally, the first rule found in a parser class defines the starting
grammar rule (top level rule). To change this, supply a ``start``
specifier in your class. For example::
class CalcParser(Parser):
start = 'foo'
@_('A B')
def bar(self, p):
...
@_('bar X')
def foo(self, p): # Parsing starts here (start symbol above)
...
The use of a ``start`` specifier may be useful during debugging
since you can use it to work with a subset of a larger grammar.
Embedded Actions Embedded Actions
^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^
@ -1454,9 +1474,3 @@ This might adjust internal symbol tables and other aspects of the
parser. Upon completion of the rule ``statements``, code parser. Upon completion of the rule ``statements``, code
undos the operations performed in the embedded action undos the operations performed in the embedded action
(e.g., ``pop_scope()``). (e.g., ``pop_scope()``).