Doc updates
This commit is contained in:
parent
54ff0c3851
commit
5c3083712f
246
docs/sly.rst
246
docs/sly.rst
@ -45,12 +45,12 @@ The next two parts describe the basics.
|
||||
Writing a Lexer
|
||||
---------------
|
||||
|
||||
Suppose you're writing a programming language and a user supplied the
|
||||
Suppose you're writing a programming language and you wanted to parse the
|
||||
following input string::
|
||||
|
||||
x = 3 + 42 * (s - t)
|
||||
|
||||
The first step of any parsing is to break the text into tokens where
|
||||
The first step of parsing is to break the text into tokens where
|
||||
each token has a type and value. For example, the above text might be
|
||||
described by the following list of token tuples::
|
||||
|
||||
@ -195,36 +195,14 @@ comments and newlines::
|
||||
from sly import Lexer
|
||||
|
||||
class CalcLexer(Lexer):
|
||||
# Set of token names. This is always required
|
||||
tokens = {
|
||||
'NUMBER',
|
||||
'ID',
|
||||
'PLUS',
|
||||
'MINUS',
|
||||
'TIMES',
|
||||
'DIVIDE',
|
||||
'ASSIGN',
|
||||
'LPAREN',
|
||||
'RPAREN',
|
||||
}
|
||||
|
||||
...
|
||||
# String containing ignored characters (between tokens)
|
||||
ignore = ' \t'
|
||||
|
||||
# Regular expression rules for tokens
|
||||
PLUS = r'\+'
|
||||
MINUS = r'-'
|
||||
TIMES = r'\*'
|
||||
DIVIDE = r'/'
|
||||
LPAREN = r'\('
|
||||
RPAREN = r'\)'
|
||||
ASSIGN = r'='
|
||||
NUMBER = r'\d+'
|
||||
ID = r'[a-zA-Z_][a-zA-Z0-9_]*'
|
||||
|
||||
# Ignored text
|
||||
# Other ignored patterns
|
||||
ignore_comment = r'\#.*'
|
||||
ignore_newline = r'\n+'
|
||||
...
|
||||
|
||||
if __name__ == '__main__':
|
||||
data = '''x = 3 + 42
|
||||
@ -250,12 +228,25 @@ expression using the ``@_()`` decorator like this::
|
||||
return t
|
||||
|
||||
The method always takes a single argument which is an instance of
|
||||
``Token``. By default, ``t.type`` is set to the name of the token
|
||||
type ``Token``. By default, ``t.type`` is set to the name of the token
|
||||
(e.g., ``'NUMBER'``). The function can change the token type and
|
||||
value as it sees appropriate. When finished, the resulting token
|
||||
object should be returned as a result. If no value is returned by the
|
||||
function, the token is discarded and the next token read.
|
||||
|
||||
The ``@_()`` decorator is defined automatically within the ``Lexer``
|
||||
class--you don't need to do any kind of special import for it.
|
||||
It can also accept multiple regular expression rules. For example::
|
||||
|
||||
@_(r'0x[0-9a-fA-F]+',
|
||||
r'\d+')
|
||||
def NUMBER(self, t):
|
||||
if t.value.startswith('0x'):
|
||||
t.value = int(t.value[2:], 16)
|
||||
else:
|
||||
t.value = int(t.value)
|
||||
return t
|
||||
|
||||
Instead of using the ``@_()`` decorator, you can also write a method
|
||||
that matches the same name as a token previously specified as a
|
||||
string. For example::
|
||||
@ -269,8 +260,9 @@ string. For example::
|
||||
return t
|
||||
|
||||
This is potentially useful trick for debugging a lexer. You can temporarily
|
||||
attach a method to fire when a token is seen and take it away later without
|
||||
changing any existing part of the lexer class.
|
||||
attach a method a token and have it execute when the token is encountered.
|
||||
If you later take the method away, the lexer will revert back to its original
|
||||
behavior.
|
||||
|
||||
Line numbers and position tracking
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
@ -279,21 +271,21 @@ By default, lexers know nothing about line numbers. This is because
|
||||
they don't know anything about what constitutes a "line" of input
|
||||
(e.g., the newline character or even if the input is textual data).
|
||||
To update this information, you need to add a special rule for newlines.
|
||||
Promote the ``ignore_newline`` token to a method like this::
|
||||
Promote the ``ignore_newline`` rule to a method like this::
|
||||
|
||||
# Define a rule so we can track line numbers
|
||||
@_(r'\n+')
|
||||
def ignore_newline(self, t):
|
||||
self.lineno += len(t.value)
|
||||
|
||||
Within the rule, the lineno attribute of the lexer is updated. After
|
||||
the line number is updated, the token is simply discarded since
|
||||
nothing is returned.
|
||||
Within the rule, the lineno attribute of the lexer is now updated.
|
||||
After the line number is updated, the token is discarded since nothing
|
||||
is returned.
|
||||
|
||||
Lexers do not perform and kind of automatic column tracking. However,
|
||||
it does record positional information related to each token in the
|
||||
it does record positional information related to each token in the token's
|
||||
``index`` attribute. Using this, it is usually possible to compute
|
||||
column information as a separate step. For instance, you could count
|
||||
column information as a separate step. For instance, you can search
|
||||
backwards until you reach the previous newline::
|
||||
|
||||
# Compute column.
|
||||
@ -321,7 +313,7 @@ Literal characters can be specified by defining a set
|
||||
literals = { '+','-','*','/' }
|
||||
...
|
||||
|
||||
A literal character is simply a single character that is returned "as
|
||||
A literal character is a *single character* that is returned "as
|
||||
is" when encountered by the lexer. Literals are checked after all of
|
||||
the defined regular expression rules. Thus, if a rule starts with one
|
||||
of the literal characters, it will always take precedence.
|
||||
@ -369,7 +361,7 @@ For example::
|
||||
print("Illegal character '%s'" % value[0])
|
||||
self.index += 1
|
||||
|
||||
In this case, we simply print the offending character and skip ahead
|
||||
In this case, we print the offending character and skip ahead
|
||||
one character by updating the lexer position. Error handling in a
|
||||
parser is often a hard problem. An error handler might scan ahead
|
||||
to a logical synchronization point such as a semicolon, a blank line,
|
||||
@ -386,7 +378,8 @@ into practice::
|
||||
from sly import Lexer
|
||||
|
||||
class CalcLexer(Lexer):
|
||||
reserved_words = { 'WHILE', 'PRINT' }
|
||||
# Set of reserved names (language keywords)
|
||||
reserved_words = { 'WHILE', 'IF', 'ELSE', 'PRINT' }
|
||||
|
||||
# Set of token names. This is always required
|
||||
tokens = {
|
||||
@ -431,6 +424,7 @@ into practice::
|
||||
|
||||
@_(r'[a-zA-Z_][a-zA-Z0-9_]*')
|
||||
def ID(self, t):
|
||||
# Check if name matches a reserved word (change token type if true)
|
||||
if t.value.upper() in self.reserved_words:
|
||||
t.type = t.value.upper()
|
||||
return t
|
||||
@ -487,15 +481,15 @@ Study this example closely. It might take a bit to digest, but all of the
|
||||
essential parts of writing a lexer are there. Tokens have to be specified
|
||||
with regular expression rules. You can optionally attach actions that
|
||||
execute when certain patterns are encountered. Certain features such as
|
||||
character literals might make it easier to lex certain kinds of text.
|
||||
You can also add error handling.
|
||||
character literals are there mainly for convenience, saving you the trouble
|
||||
of writing separate regular expression rules. You can also add error handling.
|
||||
|
||||
Writing a Parser
|
||||
----------------
|
||||
|
||||
The ``Parser`` class is used to parse language syntax. Before showing
|
||||
an example, there are a few important bits of background that must be
|
||||
mentioned.
|
||||
covered.
|
||||
|
||||
Parsing Background
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
@ -519,15 +513,19 @@ In the grammar, symbols such as ``NUMBER``, ``+``, ``-``, ``*``, and
|
||||
``/`` are known as *terminals* and correspond to raw input tokens.
|
||||
Identifiers such as ``term`` and ``factor`` refer to grammar rules
|
||||
comprised of a collection of terminals and other rules. These
|
||||
identifiers are known as *non-terminals*.
|
||||
identifiers are known as *non-terminals*. The separation of the
|
||||
grammar into different levels (e.g., ``expr`` and ``term``) encodes
|
||||
the operator precedence rules for the different operations. In this
|
||||
case, multiplication and division have higher precedence than addition
|
||||
and subtraction.
|
||||
|
||||
The semantic behavior of a language is often specified using a
|
||||
technique known as syntax directed translation. In syntax directed
|
||||
translation, values are attached to each symbol in a given grammar
|
||||
rule along with an action. Whenever a particular grammar rule is
|
||||
recognized, the action describes what to do. For example, given the
|
||||
expression grammar above, you might write the specification for a
|
||||
simple calculator like this::
|
||||
The semantics of what happens during parsing is often specified using
|
||||
a technique known as syntax directed translation. In syntax directed
|
||||
translation, the symbols in the grammar become a kind of
|
||||
object. Values can be attached each symbol and operations carried out
|
||||
on those values when different grammar rules are recognized. For
|
||||
example, given the expression grammar above, you might write the
|
||||
specification for the operation of a simple calculator like this::
|
||||
|
||||
Grammar Action
|
||||
------------------------ --------------------------------
|
||||
@ -542,12 +540,23 @@ simple calculator like this::
|
||||
factor : NUMBER factor.val = int(NUMBER.val)
|
||||
| ( expr ) factor.val = expr.val
|
||||
|
||||
A good way to think about syntax directed translation is to view each
|
||||
symbol in the grammar as a kind of object. Associated with each symbol
|
||||
is a value representing its "state" (for example, the ``val``
|
||||
attribute above). Semantic actions are then expressed as a collection
|
||||
of functions or methods that operate on the symbols and associated
|
||||
values.
|
||||
In this grammar, new values enter via the ``NUMBER`` token. Those
|
||||
values then propagate according to the actions described above. For
|
||||
example, ``factor.val = int(NUMBER.val)`` propagates the value from
|
||||
``NUMBER`` to ``factor``. ``term0.val = factor.val`` propagates the
|
||||
value from ``factor`` to ``term``. Rules such as ``expr0.val =
|
||||
expr1.val + term1.val`` combine and propagate values further. Just to
|
||||
illustrate, here is how values propagate in the expression ``2 + 3 * 4``::
|
||||
|
||||
NUMBER.val=2 + NUMBER.val=3 * NUMBER.val=4 # NUMBER -> factor
|
||||
factor.val=2 + NUMBER.val=3 * NUMBER.val=4 # factor -> term
|
||||
term.val=2 + NUMBER.val=3 * NUMBER.val=4 # term -> expr
|
||||
expr.val=2 + NUMBER.val=3 * NUMBER.val=4 # NUMBER -> factor
|
||||
expr.val=2 + factor.val=3 * NUMBER.val=4 # factor -> term
|
||||
expr.val=2 + term.val=3 * NUMBER.val=4 # NUMBER -> factor
|
||||
expr.val=2 + term.val=3 * factor.val=4 # term * factor -> term
|
||||
expr.val=2 + term.val=12 # expr + term -> expr
|
||||
expr.val=14
|
||||
|
||||
SLY uses a parsing technique known as LR-parsing or shift-reduce
|
||||
parsing. LR parsing is a bottom up technique that tries to recognize
|
||||
@ -659,10 +668,6 @@ SLY::
|
||||
def factor(self, p):
|
||||
return p[1]
|
||||
|
||||
# Error rule for syntax errors
|
||||
def error(self, p):
|
||||
print("Syntax error in input!")
|
||||
|
||||
if __name__ == '__main__':
|
||||
lexer = CalcLexer()
|
||||
parser = CalcParser()
|
||||
@ -677,10 +682,10 @@ SLY::
|
||||
|
||||
In this example, each grammar rule is defined by a method that's been
|
||||
decorated by ``@_(rule)`` decorator. The very first grammar rule
|
||||
defines the top of the parse. The name of each method should match
|
||||
the name of the grammar rule being parsed. The argument to the
|
||||
``@_()`` decorator is a string describing the right-hand-side of the
|
||||
grammar. Thus, a grammar rule like this::
|
||||
defines the top of the parse (the first rule listed in a BNF grammar).
|
||||
The name of each method must match the name of the grammar rule being
|
||||
parsed. The argument to the ``@_()`` decorator is a string describing
|
||||
the right-hand-side of the grammar. Thus, a grammar rule like this::
|
||||
|
||||
expr : expr PLUS term
|
||||
|
||||
@ -692,7 +697,7 @@ becomes a method like this::
|
||||
|
||||
The method is triggered when that grammar rule is recognized on the
|
||||
input. As an argument, the method receives a sequence of grammar symbol
|
||||
values ``p`` that is accessed as an array. The mapping between
|
||||
values ``p`` that is accessed as an array of symbols. The mapping between
|
||||
elements of ``p`` and the grammar rule is as shown here::
|
||||
|
||||
# p[0] p[1] p[2]
|
||||
@ -735,10 +740,8 @@ or perhaps create an instance related to an abstract syntax tree::
|
||||
return BinOp('+', p[0], p[2])
|
||||
|
||||
The key thing is that the method returns the value that's going to
|
||||
be attached to the symbol "expr" in this case.
|
||||
|
||||
The ``error()`` method is defined to handle syntax errors (if any).
|
||||
See the error handling section below for more detail.
|
||||
be attached to the symbol "expr" in this case. This is the propagation
|
||||
of values that was described in the previous section.
|
||||
|
||||
Combining Grammar Rule Functions
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
@ -789,12 +792,12 @@ declaration::
|
||||
|
||||
class CalcLexer(Lexer):
|
||||
...
|
||||
literals = ['+','-','*','/' ]
|
||||
literals = { '+','-','*','/' }
|
||||
...
|
||||
|
||||
Character literals are limited to a single character. Thus, it is not
|
||||
legal to specify literals such as ``<=`` or ``==``. For this, use the
|
||||
normal lexing rules (e.g., define a rule such as ``EQ = r'=='``).
|
||||
normal lexing rules (e.g., define a rule such as ``LE = r'<='``).
|
||||
|
||||
Empty Productions
|
||||
^^^^^^^^^^^^^^^^^
|
||||
@ -806,7 +809,19 @@ If you need an empty production, define a special rule like this::
|
||||
pass
|
||||
|
||||
Now to use the empty production elsewhere, use the name 'empty' as a symbol. For
|
||||
example::
|
||||
example, suppose you need to encode a rule that involved an optional item like this::
|
||||
|
||||
spam : optitem grok
|
||||
|
||||
optitem : item
|
||||
| empty
|
||||
|
||||
|
||||
You would encode the rules in SLY as follows::
|
||||
|
||||
@_('optitem grok')
|
||||
def spam(self, p):
|
||||
...
|
||||
|
||||
@_('item')
|
||||
def optitem(self, p):
|
||||
@ -816,32 +831,11 @@ example::
|
||||
def optitem(self, p):
|
||||
...
|
||||
|
||||
Note: You can write empty rules anywhere by simply specifying an empty
|
||||
Note: You could write empty rules anywhere by specifying an empty
|
||||
string. However,writing an "empty" rule and using "empty" to denote an
|
||||
empty production may be easier to read and more clearly state your
|
||||
intention.
|
||||
|
||||
Changing the starting symbol
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Normally, the first rule found in a parser class defines the starting
|
||||
grammar rule (top level rule). To change this, supply a ``start``
|
||||
specifier in your class. For example::
|
||||
|
||||
class CalcParser(Parser):
|
||||
start = 'foo'
|
||||
|
||||
@_('A B')
|
||||
def bar(self, p):
|
||||
...
|
||||
|
||||
@_('bar X')
|
||||
def foo(self, p): # Parsing starts here (start symbol above)
|
||||
...
|
||||
|
||||
The use of a ``start`` specifier may be useful during debugging
|
||||
since you can use it to work with a subset of a larger grammar.
|
||||
|
||||
Dealing With Ambiguous Grammars
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
@ -871,7 +865,7 @@ example, consider the string "3 * 4 + 5" and the internal parsing
|
||||
stack::
|
||||
|
||||
Step Symbol Stack Input Tokens Action
|
||||
---- --------------------- --------------------- -------------------------------
|
||||
---- ------------- ---------------- -------------------------------
|
||||
1 $ 3 * 4 + 5$ Shift 3
|
||||
2 $ 3 * 4 + 5$ Reduce : expr : NUMBER
|
||||
3 $ expr * 4 + 5$ Shift *
|
||||
@ -944,11 +938,12 @@ associativity specifiers.
|
||||
|
||||
2. If the grammar rule on the stack has higher precedence, the rule is reduced.
|
||||
|
||||
3. If the current token and the grammar rule have the same precedence, the
|
||||
rule is reduced for left associativity, whereas the token is shifted for right associativity.
|
||||
3. If the current token and the grammar rule have the same precedence,
|
||||
the rule is reduced for left associativity, whereas the token is
|
||||
shifted for right associativity.
|
||||
|
||||
4. If nothing is known about the precedence, shift/reduce conflicts are resolved in
|
||||
favor of shifting (the default).
|
||||
4. If nothing is known about the precedence, shift/reduce conflicts
|
||||
are resolved in favor of shifting (the default).
|
||||
|
||||
For example, if ``expr PLUS expr`` has been parsed and the
|
||||
next token is ``TIMES``, the action is going to be a shift because
|
||||
@ -994,10 +989,11 @@ When you use the ``%prec`` qualifier, you're telling SLY
|
||||
that you want the precedence of the expression to be the same as for
|
||||
this special marker instead of the usual precedence.
|
||||
|
||||
It is also possible to specify non-associativity in the ``precedence`` table. This would
|
||||
be used when you *don't* want operations to chain together. For example, suppose
|
||||
you wanted to support comparison operators like ``<`` and ``>`` but you didn't want to allow
|
||||
combinations like ``a < b < c``. To do this, specify a rule like this::
|
||||
It is also possible to specify non-associativity in the ``precedence``
|
||||
table. This is used when you *don't* want operations to chain
|
||||
together. For example, suppose you wanted to support comparison
|
||||
operators like ``<`` and ``>`` but you didn't want combinations like
|
||||
``a < b < c``. To do this, specify the precedence rules like this::
|
||||
|
||||
class MyParser(Parser):
|
||||
...
|
||||
@ -1140,8 +1136,9 @@ When a syntax error occurs, SLY performs the following steps:
|
||||
5. If a grammar rule accepts ``error`` as a token, it will be
|
||||
shifted onto the parsing stack.
|
||||
|
||||
6. If the top item of the parsing stack is ``error``, lookahead tokens will be discarded until the
|
||||
parser can successfully shift a new symbol or reduce a rule involving ``error``.
|
||||
6. If the top item of the parsing stack is ``error``, lookahead tokens
|
||||
will be discarded until the parser can successfully shift a new
|
||||
symbol or reduce a rule involving ``error``.
|
||||
|
||||
Recovery and resynchronization with error rules
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@ -1181,7 +1178,9 @@ appear as the last token on the right in an error rule. For example::
|
||||
|
||||
This is because the first bad token encountered will cause the rule to
|
||||
be reduced--which may make it difficult to recover if more bad tokens
|
||||
immediately follow.
|
||||
immediately follow. It's better to have some kind of landmark such as
|
||||
a semicolon, closing parenthesese, or other token that can be used as
|
||||
a synchronization point.
|
||||
|
||||
Panic mode recovery
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
@ -1208,7 +1207,7 @@ state::
|
||||
break
|
||||
self.restart()
|
||||
|
||||
This function simply discards the bad token and tells the parser that
|
||||
This function discards the bad token and tells the parser that
|
||||
the error was ok::
|
||||
|
||||
def error(self, p):
|
||||
@ -1278,7 +1277,7 @@ Line Number and Position Tracking
|
||||
|
||||
Position tracking is often a tricky problem when writing compilers.
|
||||
By default, SLY tracks the line number and position of all tokens.
|
||||
The following attributes may be useful in a production method:
|
||||
The following attributes may be useful in a production rule:
|
||||
|
||||
- ``p.lineno``. Line number of the left-most terminal in a production.
|
||||
- ``p.index``. Lexing index of the left-most terminal in a production.
|
||||
@ -1301,9 +1300,9 @@ AST Construction
|
||||
SLY provides no special functions for constructing an abstract syntax
|
||||
tree. However, such construction is easy enough to do on your own.
|
||||
|
||||
A minimal way to construct a tree is to simply create and
|
||||
A minimal way to construct a tree is to create and
|
||||
propagate a tuple or list in each grammar rule function. There
|
||||
are many possible ways to do this, but one example would be something
|
||||
are many possible ways to do this, but one example is something
|
||||
like this::
|
||||
|
||||
@_('expr PLUS expr',
|
||||
@ -1357,6 +1356,27 @@ The advantage to this approach is that it may make it easier to attach
|
||||
more complicated semantics, type checking, code generation, and other
|
||||
features to the node classes.
|
||||
|
||||
Changing the starting symbol
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Normally, the first rule found in a parser class defines the starting
|
||||
grammar rule (top level rule). To change this, supply a ``start``
|
||||
specifier in your class. For example::
|
||||
|
||||
class CalcParser(Parser):
|
||||
start = 'foo'
|
||||
|
||||
@_('A B')
|
||||
def bar(self, p):
|
||||
...
|
||||
|
||||
@_('bar X')
|
||||
def foo(self, p): # Parsing starts here (start symbol above)
|
||||
...
|
||||
|
||||
The use of a ``start`` specifier may be useful during debugging
|
||||
since you can use it to work with a subset of a larger grammar.
|
||||
|
||||
Embedded Actions
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
@ -1454,9 +1474,3 @@ This might adjust internal symbol tables and other aspects of the
|
||||
parser. Upon completion of the rule ``statements``, code
|
||||
undos the operations performed in the embedded action
|
||||
(e.g., ``pop_scope()``).
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user