2016-09-07 22:00:47 +02:00
|
|
|
SLY (Sly Lex Yacc)
|
|
|
|
==================
|
|
|
|
|
|
|
|
This document provides an overview of lexing and parsing with SLY.
|
|
|
|
Given the intrinsic complexity of parsing, I would strongly advise
|
|
|
|
that you read (or at least skim) this entire document before jumping
|
|
|
|
into a big development project with SLY.
|
|
|
|
|
|
|
|
SLY requires Python 3.5 or newer. If you're using an older version,
|
|
|
|
you're out of luck. Sorry.
|
|
|
|
|
|
|
|
Introduction
|
|
|
|
------------
|
|
|
|
SLY is library for writing parsers and compilers. It is loosely
|
|
|
|
based on the traditional compiler construction tools lex and yacc
|
|
|
|
and implements the same LALR(1) parsing algorithm. Most of the
|
|
|
|
features available in lex and yacc are also available in SLY.
|
|
|
|
It should also be noted that SLY does not provide much in
|
|
|
|
the way of bells and whistles (e.g., automatic construction of
|
|
|
|
abstract syntax trees, tree traversal, etc.). Nor should you view it
|
|
|
|
as a parsing framework. Instead, you will find a bare-bones, yet
|
|
|
|
fully capable library for writing parsers in Python.
|
|
|
|
|
|
|
|
The rest of this document assumes that you are somewhat familiar with
|
|
|
|
parsing theory, syntax directed translation, and the use of compiler
|
|
|
|
construction tools such as lex and yacc in other programming
|
|
|
|
languages. If you are unfamiliar with these topics, you will probably
|
|
|
|
want to consult an introductory text such as "Compilers: Principles,
|
|
|
|
Techniques, and Tools", by Aho, Sethi, and Ullman. O'Reilly's "Lex
|
|
|
|
and Yacc" by John Levine may also be handy. In fact, the O'Reilly book can be
|
|
|
|
used as a reference for SLY as the concepts are virtually identical.
|
|
|
|
|
|
|
|
SLY Overview
|
|
|
|
------------
|
|
|
|
|
|
|
|
SLY provides two separate classes ``Lexer`` and ``Parser``. The
|
|
|
|
``Lexer`` class is used to break input text into a collection of
|
|
|
|
tokens specified by a collection of regular expression rules. The
|
|
|
|
``Parser`` class is used to recognize language syntax that has been
|
|
|
|
specified in the form of a context free grammar. The two classes
|
|
|
|
are typically used together to make a parser. However, this is not
|
|
|
|
a strict requirement--there is a great deal of flexibility allowed.
|
|
|
|
The next two parts describe the basics.
|
|
|
|
|
|
|
|
Writing a Lexer
|
|
|
|
---------------
|
|
|
|
|
|
|
|
Suppose you're writing a programming language and a user supplied the
|
|
|
|
following input string::
|
|
|
|
|
|
|
|
x = 3 + 42 * (s - t)
|
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
The first step of any parsing is to break the text into tokens where
|
|
|
|
each token has a type and value. For example, the above text might be
|
|
|
|
described by the following list of token tuples::
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
[ ('ID','x'), ('EQUALS','='), ('NUMBER','3'),
|
2016-09-08 00:59:09 +02:00
|
|
|
('PLUS','+'), ('NUMBER','42'), ('TIMES','*'),
|
2016-09-07 22:00:47 +02:00
|
|
|
('LPAREN','('), ('ID','s'), ('MINUS','-'),
|
|
|
|
('ID','t'), ('RPAREN',')' ]
|
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
The SLY ``Lexer`` class is used to do this. Here is a sample of a simple
|
|
|
|
lexer::
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
# ------------------------------------------------------------
|
|
|
|
# calclex.py
|
|
|
|
#
|
2016-09-08 00:59:09 +02:00
|
|
|
# Lexer for a simple expression evaluator for
|
2016-09-07 22:00:47 +02:00
|
|
|
# numbers and +,-,*,/
|
|
|
|
# ------------------------------------------------------------
|
|
|
|
|
|
|
|
from sly import Lexer
|
|
|
|
|
|
|
|
class CalcLexer(Lexer):
|
|
|
|
# List of token names. This is always required
|
|
|
|
tokens = (
|
|
|
|
'NUMBER',
|
|
|
|
'PLUS',
|
|
|
|
'MINUS',
|
|
|
|
'TIMES',
|
|
|
|
'DIVIDE',
|
|
|
|
'LPAREN',
|
|
|
|
'RPAREN',
|
|
|
|
)
|
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
# String containing ignored characters (spaces and tabs)
|
2016-09-07 22:00:47 +02:00
|
|
|
ignore = ' \t'
|
|
|
|
|
|
|
|
# Regular expression rules for simple tokens
|
|
|
|
PLUS = r'\+'
|
|
|
|
MINUS = r'-'
|
|
|
|
TIMES = r'\*'
|
|
|
|
DIVIDE = r'/'
|
|
|
|
LPAREN = r'\('
|
|
|
|
RPAREN = r'\)'
|
|
|
|
|
|
|
|
# A regular expression rule with some action code
|
|
|
|
@_(r'\d+')
|
|
|
|
def NUMBER(self, t):
|
|
|
|
t.value = int(t.value)
|
|
|
|
return t
|
|
|
|
|
|
|
|
# Define a rule so we can track line numbers
|
|
|
|
@_(r'\n+')
|
|
|
|
def newline(self, t):
|
|
|
|
self.lineno += len(t.value)
|
|
|
|
|
|
|
|
# Error handling rule (skips ahead one character)
|
|
|
|
def error(self, value):
|
2016-09-08 00:59:09 +02:00
|
|
|
print("Line %d: Illegal character '%s'" %
|
|
|
|
(self.lineno, value[0]))
|
2016-09-07 22:00:47 +02:00
|
|
|
self.index += 1
|
|
|
|
|
|
|
|
if __name__ == '__main__':
|
|
|
|
data = '''
|
|
|
|
3 + 4 * 10
|
|
|
|
+ -20 * ^ 2
|
|
|
|
'''
|
|
|
|
|
|
|
|
lexer = CalcLexer()
|
|
|
|
for tok in lexer.tokenize(data):
|
|
|
|
print(tok)
|
|
|
|
|
|
|
|
When executed, the example will produce the following output::
|
|
|
|
|
|
|
|
Token(NUMBER, 3, 2, 14)
|
|
|
|
Token(PLUS, '+', 2, 16)
|
|
|
|
Token(NUMBER, 4, 2, 18)
|
|
|
|
Token(TIMES, '*', 2, 20)
|
|
|
|
Token(NUMBER, 10, 2, 22)
|
|
|
|
Token(PLUS, '+', 3, 40)
|
|
|
|
Token(MINUS, '-', 3, 42)
|
|
|
|
Token(NUMBER, 20, 3, 43)
|
|
|
|
Token(TIMES, '*', 3, 46)
|
|
|
|
Line 3: Illegal character '^'
|
|
|
|
Token(NUMBER, 2, 3, 50)
|
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
A lexer only has one public method ``tokenize()``. This is a generator
|
|
|
|
function that produces a stream of ``Token`` instances.
|
|
|
|
The ``type`` and ``value`` attributes of ``Token`` contain the
|
|
|
|
token type name and value respectively. The ``lineno`` and ``index``
|
2016-09-07 22:00:47 +02:00
|
|
|
attributes contain the line number and position in the input text
|
2016-09-08 00:59:09 +02:00
|
|
|
where the token appears.
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
The tokens list
|
2016-09-08 00:59:09 +02:00
|
|
|
^^^^^^^^^^^^^^^
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
Lexers must specify a ``tokens`` attribute that defines all of the possible token
|
|
|
|
type names that can be produced by the lexer. This list is always required
|
2016-09-07 22:00:47 +02:00
|
|
|
and is used to perform a variety of validation checks.
|
|
|
|
|
|
|
|
In the example, the following code specified the token names::
|
|
|
|
|
|
|
|
class CalcLexer(Lexer):
|
|
|
|
...
|
|
|
|
# List of token names. This is always required
|
|
|
|
tokens = (
|
|
|
|
'NUMBER',
|
|
|
|
'PLUS',
|
|
|
|
'MINUS',
|
|
|
|
'TIMES',
|
|
|
|
'DIVIDE',
|
|
|
|
'LPAREN',
|
|
|
|
'RPAREN',
|
|
|
|
)
|
|
|
|
...
|
|
|
|
|
|
|
|
Specification of tokens
|
2016-09-08 00:59:09 +02:00
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
Tokens are specified by writing a regular expression rule compatible
|
|
|
|
with Python's ``re`` module. This is done by writing definitions that
|
|
|
|
match one of the names of the tokens provided in the ``tokens``
|
|
|
|
attribute. For example::
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
PLUS = r'\+'
|
|
|
|
MINUS = r'-'
|
2016-09-08 00:59:09 +02:00
|
|
|
|
|
|
|
Sometimes you want to perform an action when a token is matched. For example,
|
|
|
|
maybe you want to convert a numeric value or look up a symbol. To do
|
|
|
|
this, write your action as a method and give the associated regular
|
|
|
|
expression using the ``@_()`` decorator like this::
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
@_(r'\d+')
|
|
|
|
def NUMBER(self, t):
|
|
|
|
t.value = int(t.value)
|
|
|
|
return t
|
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
The method always takes a single argument which is an instance of
|
|
|
|
``Token``. By default, ``t.type`` is set to the name of the token
|
|
|
|
(e.g., ``'NUMBER'``). The function can change the token type and
|
|
|
|
value as it sees appropriate. When finished, the resulting token
|
|
|
|
object should be returned as a result. If no value is returned by the
|
2016-09-07 22:00:47 +02:00
|
|
|
function, the token is simply discarded and the next token read.
|
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
Internally, the ``Lexer`` class uses the ``re`` module to do its
|
|
|
|
pattern matching. Patterns are compiled using the ``re.VERBOSE`` flag
|
|
|
|
which can be used to help readability. However, be aware that
|
|
|
|
unescaped whitespace is ignored and comments are allowed in this mode.
|
|
|
|
If your pattern involves whitespace, make sure you use ``\s``. If you
|
|
|
|
need to match the ``#`` character, use ``[#]``.
|
|
|
|
|
|
|
|
Controlling Match Order
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
Tokens are matched in the same order as patterns are listed in the
|
|
|
|
``Lexer`` class. Be aware that longer tokens may need to be specified
|
|
|
|
before short tokens. For example, if you wanted to have separate
|
|
|
|
tokens for "=" and "==", you need to make sure that "==" is listed
|
|
|
|
first. For example::
|
|
|
|
|
|
|
|
class MyLexer(Lexer):
|
|
|
|
tokens = ('ASSIGN', 'EQUALTO', ...)
|
|
|
|
...
|
|
|
|
EQUALTO = r'==' # MUST APPEAR FIRST!
|
|
|
|
ASSIGN = r'='
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
To handle reserved words, you should write a single rule to match an
|
|
|
|
identifier and do a special name lookup in a function like this::
|
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
class MyLexer(Lexer):
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
reserved = { 'if', 'then', 'else', 'while' }
|
|
|
|
tokens = ['LPAREN','RPAREN',...,'ID'] + [ w.upper() for w in reserved ]
|
|
|
|
|
|
|
|
@_(r'[a-zA-Z_][a-zA-Z_0-9]*')
|
|
|
|
def ID(self, t):
|
2016-09-08 00:59:09 +02:00
|
|
|
# Check to see if the name is a reserved word
|
|
|
|
# If so, change its type.
|
2016-09-07 22:00:47 +02:00
|
|
|
if t.value in self.reserved:
|
|
|
|
t.type = t.value.upper()
|
|
|
|
return t
|
|
|
|
|
|
|
|
Note: You should avoid writing individual rules for reserved words.
|
|
|
|
For example, suppose you wrote rules like this::
|
|
|
|
|
|
|
|
FOR = r'for'
|
|
|
|
PRINT = r'print'
|
|
|
|
|
|
|
|
In this case, the rules will be triggered for identifiers that include
|
2016-09-08 00:59:09 +02:00
|
|
|
those words as a prefix such as "forget" or "printed".
|
|
|
|
This is probably not what you want.
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
Discarded text
|
|
|
|
^^^^^^^^^^^^^^
|
|
|
|
To discard text, such as a comment, simply define a token rule that returns no value. For example::
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
@_(r'\#.*')
|
|
|
|
def COMMENT(self, t):
|
|
|
|
pass
|
|
|
|
# No return value. Token discarded
|
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
Alternatively, you can include the prefix ``ignore_`` in a token
|
2016-09-08 02:20:11 +02:00
|
|
|
declaration to force a token to be ignored. For example::
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
ignore_COMMENT = r'\#.*'
|
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
Line numbers and position tracking
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
By default, lexers know nothing about line numbers. This is because
|
|
|
|
they don't know anything about what constitutes a "line" of input
|
|
|
|
(e.g., the newline character or even if the input is textual data).
|
|
|
|
To update this information, you need to write a special rule. In the
|
|
|
|
example, the ``newline()`` rule shows how to do this::
|
|
|
|
|
|
|
|
# Define a rule so we can track line numbers
|
|
|
|
@_(r'\n+')
|
|
|
|
def newline(self, t):
|
|
|
|
self.lineno += len(t.value)
|
|
|
|
|
|
|
|
Within the rule, the lineno attribute of the lexer is updated. After
|
|
|
|
the line number is updated, the token is simply discarded since
|
|
|
|
nothing is returned.
|
|
|
|
|
|
|
|
Lexers do not perform and kind of automatic column tracking. However,
|
|
|
|
it does record positional information related to each token in the
|
|
|
|
``index`` attribute. Using this, it is usually possible to compute
|
|
|
|
column information as a separate step. For instance, you could count
|
2016-09-08 00:59:09 +02:00
|
|
|
backwards until you reach the previous newline::
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
# Compute column.
|
|
|
|
# input is the input text string
|
|
|
|
# token is a token instance
|
|
|
|
def find_column(text, token):
|
|
|
|
last_cr = text.rfind('\n', 0, token.index)
|
|
|
|
if last_cr < 0:
|
|
|
|
last_cr = 0
|
|
|
|
column = (token.index - last_cr) + 1
|
|
|
|
return column
|
|
|
|
|
|
|
|
Since column information is often only useful in the context of error
|
|
|
|
handling, calculating the column position can be performed when needed
|
2016-09-08 00:59:09 +02:00
|
|
|
as opposed to including it on each token.
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
Ignored characters
|
2016-09-08 00:59:09 +02:00
|
|
|
^^^^^^^^^^^^^^^^^^
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
The special ``ignore`` specification is reserved for characters that
|
|
|
|
should be completely ignored in the input stream. Usually this is
|
|
|
|
used to skip over whitespace and other non-essential characters.
|
|
|
|
Although it is possible to define a regular expression rule for
|
|
|
|
whitespace in a manner similar to ``newline()``, the use of ``ignore``
|
|
|
|
provides substantially better lexing performance because it is handled
|
|
|
|
as a special case and is checked in a much more efficient manner than
|
|
|
|
the normal regular expression rules.
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
The characters given in ``ignore`` are not ignored when such
|
|
|
|
characters are part of other regular expression patterns. For
|
|
|
|
example, if you had a rule to capture quoted text, that pattern can
|
|
|
|
include the ignored characters (which will be captured in the normal
|
|
|
|
way). The main purpose of ``ignore`` is to ignore whitespace and
|
|
|
|
other padding between the tokens that you actually want to parse.
|
|
|
|
|
|
|
|
Literal characters
|
2016-09-08 00:59:09 +02:00
|
|
|
^^^^^^^^^^^^^^^^^^
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
Literal characters can be specified by defining a variable
|
|
|
|
``literals`` in the class. For example::
|
2016-09-07 22:00:47 +02:00
|
|
|
|
|
|
|
class MyLexer(Lexer):
|
|
|
|
...
|
|
|
|
literals = [ '+','-','*','/' ]
|
|
|
|
...
|
|
|
|
|
|
|
|
A literal character is simply a single character that is returned "as
|
|
|
|
is" when encountered by the lexer. Literals are checked after all of
|
|
|
|
the defined regular expression rules. Thus, if a rule starts with one
|
|
|
|
of the literal characters, it will always take precedence.
|
|
|
|
|
|
|
|
When a literal token is returned, both its ``type`` and ``value``
|
|
|
|
attributes are set to the character itself. For example, ``'+'``.
|
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
It's possible to write token methods that perform additional actions
|
2016-09-07 22:00:47 +02:00
|
|
|
when literals are matched. However, you'll need to set the token type
|
|
|
|
appropriately. For example::
|
|
|
|
|
|
|
|
class MyLexer(Lexer):
|
|
|
|
|
|
|
|
literals = [ '{', '}' ]
|
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
def __init__(self):
|
|
|
|
self.indentation_level = 0
|
|
|
|
|
2016-09-07 22:00:47 +02:00
|
|
|
@_(r'\{')
|
|
|
|
def lbrace(self, t):
|
|
|
|
t.type = '{' # Set token type to the expected literal
|
2016-09-08 00:59:09 +02:00
|
|
|
self.indentation_level += 1
|
2016-09-07 22:00:47 +02:00
|
|
|
return t
|
|
|
|
|
|
|
|
@_(r'\}')
|
|
|
|
def rbrace(t):
|
|
|
|
t.type = '}' # Set token type to the expected literal
|
2016-09-08 00:59:09 +02:00
|
|
|
self.indentation_level -=1
|
2016-09-07 22:00:47 +02:00
|
|
|
return t
|
|
|
|
|
|
|
|
Error handling
|
2016-09-08 00:59:09 +02:00
|
|
|
^^^^^^^^^^^^^^
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
The ``error()`` method is used to handle lexing errors that occur when
|
|
|
|
illegal characters are detected. The error method receives a string
|
|
|
|
containing all remaining untokenized text. A typical handler might
|
|
|
|
look at this text and skip ahead in some manner. For example::
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
class MyLexer(Lexer):
|
|
|
|
...
|
|
|
|
# Error handling rule
|
|
|
|
def error(self, value):
|
|
|
|
print("Illegal character '%s'" % value[0])
|
|
|
|
self.index += 1
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
In this case, we simply print the offending character and skip ahead
|
|
|
|
one character by updating the lexer position. Error handling in a
|
|
|
|
parser is often a hard problem. An error handler might scan ahead
|
|
|
|
to a logical synchronization point such as a semicolon, a blank line,
|
|
|
|
or similar landmark.
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
Maintaining extra state
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
In your lexer, you may want to maintain a variety of other state
|
2016-09-07 22:00:47 +02:00
|
|
|
information. This might include mode settings, symbol tables, and
|
2016-09-08 00:59:09 +02:00
|
|
|
other details. As an example, suppose that you wanted to keep track
|
|
|
|
of how many NUMBER tokens had been encountered. You can do this by
|
|
|
|
adding an ``__init__()`` method and adding more attributes. For
|
|
|
|
example::
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
class MyLexer(Lexer):
|
|
|
|
def __init__(self):
|
|
|
|
self.num_count = 0
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 00:59:09 +02:00
|
|
|
@_(r'\d+')
|
|
|
|
def NUMBER(self,t):
|
|
|
|
self.num_count += 1
|
|
|
|
t.value = int(t.value)
|
|
|
|
return t
|
|
|
|
|
|
|
|
Please note that lexers already use the ``lineno`` and ``position``
|
|
|
|
attributes during parsing.
|
2016-09-07 22:00:47 +02:00
|
|
|
|
2016-09-08 22:05:03 +02:00
|
|
|
Writing a Parser
|
|
|
|
----------------
|
|
|
|
|
|
|
|
The ``Parser`` class is used to parse language syntax. Before showing
|
|
|
|
an example, there are a few important bits of background that must be
|
2016-09-09 19:57:39 +02:00
|
|
|
mentioned.
|
|
|
|
|
|
|
|
Parsing Background
|
|
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
When writing a parser, *syntax* is usually specified in terms of a BNF
|
2016-09-08 22:05:03 +02:00
|
|
|
grammar. For example, if you wanted to parse simple arithmetic
|
|
|
|
expressions, you might first write an unambiguous grammar
|
|
|
|
specification like this::
|
|
|
|
|
|
|
|
expr : expr + term
|
|
|
|
| expr - term
|
|
|
|
| term
|
|
|
|
|
|
|
|
term : term * factor
|
|
|
|
| term / factor
|
|
|
|
| factor
|
|
|
|
|
|
|
|
factor : NUMBER
|
|
|
|
| ( expr )
|
|
|
|
|
|
|
|
In the grammar, symbols such as ``NUMBER``, ``+``, ``-``, ``*``, and
|
|
|
|
``/`` are known as *terminals* and correspond to raw input tokens.
|
|
|
|
Identifiers such as ``term`` and ``factor`` refer to grammar rules
|
|
|
|
comprised of a collection of terminals and other rules. These
|
|
|
|
identifiers are known as *non-terminals*.
|
|
|
|
|
|
|
|
The semantic behavior of a language is often specified using a
|
|
|
|
technique known as syntax directed translation. In syntax directed
|
2016-09-09 19:57:39 +02:00
|
|
|
translation, values are attached to each symbol in a given grammar
|
2016-09-08 22:05:03 +02:00
|
|
|
rule along with an action. Whenever a particular grammar rule is
|
|
|
|
recognized, the action describes what to do. For example, given the
|
|
|
|
expression grammar above, you might write the specification for a
|
|
|
|
simple calculator like this::
|
|
|
|
|
|
|
|
Grammar Action
|
|
|
|
------------------------ --------------------------------
|
|
|
|
expr0 : expr1 + term expr0.val = expr1.val + term.val
|
|
|
|
| expr1 - term expr0.val = expr1.val - term.val
|
|
|
|
| term expr0.val = term.val
|
|
|
|
|
|
|
|
term0 : term1 * factor term0.val = term1.val * factor.val
|
|
|
|
| term1 / factor term0.val = term1.val / factor.val
|
|
|
|
| factor term0.val = factor.val
|
|
|
|
|
|
|
|
factor : NUMBER factor.val = int(NUMBER.val)
|
|
|
|
| ( expr ) factor.val = expr.val
|
|
|
|
|
|
|
|
A good way to think about syntax directed translation is to view each
|
|
|
|
symbol in the grammar as a kind of object. Associated with each symbol
|
|
|
|
is a value representing its "state" (for example, the ``val``
|
|
|
|
attribute above). Semantic actions are then expressed as a collection
|
|
|
|
of functions or methods that operate on the symbols and associated
|
|
|
|
values.
|
|
|
|
|
|
|
|
SLY uses a parsing technique known as LR-parsing or shift-reduce
|
|
|
|
parsing. LR parsing is a bottom up technique that tries to recognize
|
|
|
|
the right-hand-side of various grammar rules. Whenever a valid
|
2016-09-09 19:57:39 +02:00
|
|
|
right-hand-side is found in the input, the appropriate action method
|
|
|
|
is triggered and the grammar symbols on right hand side are replaced
|
|
|
|
by the grammar symbol on the left-hand-side.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
LR parsing is commonly implemented by shifting grammar symbols onto a
|
|
|
|
stack and looking at the stack and the next input token for patterns
|
|
|
|
that match one of the grammar rules. The details of the algorithm can
|
|
|
|
be found in a compiler textbook, but the following example illustrates
|
2016-09-09 19:57:39 +02:00
|
|
|
the steps that are performed when parsing the expression ``3 + 5 * (10
|
|
|
|
- 20)`` using the grammar defined above. In the example, the special
|
|
|
|
symbol ``$`` represents the end of input::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
Step Symbol Stack Input Tokens Action
|
|
|
|
---- --------------------- --------------------- -------------------------------
|
|
|
|
1 3 + 5 * ( 10 - 20 )$ Shift 3
|
|
|
|
2 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER
|
|
|
|
3 factor + 5 * ( 10 - 20 )$ Reduce term : factor
|
|
|
|
4 term + 5 * ( 10 - 20 )$ Reduce expr : term
|
|
|
|
5 expr + 5 * ( 10 - 20 )$ Shift +
|
|
|
|
6 expr + 5 * ( 10 - 20 )$ Shift 5
|
|
|
|
7 expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER
|
|
|
|
8 expr + factor * ( 10 - 20 )$ Reduce term : factor
|
|
|
|
9 expr + term * ( 10 - 20 )$ Shift *
|
|
|
|
10 expr + term * ( 10 - 20 )$ Shift (
|
|
|
|
11 expr + term * ( 10 - 20 )$ Shift 10
|
|
|
|
12 expr + term * ( 10 - 20 )$ Reduce factor : NUMBER
|
|
|
|
13 expr + term * ( factor - 20 )$ Reduce term : factor
|
|
|
|
14 expr + term * ( term - 20 )$ Reduce expr : term
|
|
|
|
15 expr + term * ( expr - 20 )$ Shift -
|
|
|
|
16 expr + term * ( expr - 20 )$ Shift 20
|
|
|
|
17 expr + term * ( expr - 20 )$ Reduce factor : NUMBER
|
|
|
|
18 expr + term * ( expr - factor )$ Reduce term : factor
|
|
|
|
19 expr + term * ( expr - term )$ Reduce expr : expr - term
|
|
|
|
20 expr + term * ( expr )$ Shift )
|
|
|
|
21 expr + term * ( expr ) $ Reduce factor : (expr)
|
|
|
|
22 expr + term * factor $ Reduce term : term * factor
|
|
|
|
23 expr + term $ Reduce expr : expr + term
|
|
|
|
24 expr $ Reduce expr
|
|
|
|
25 $ Success!
|
|
|
|
|
|
|
|
When parsing the expression, an underlying state machine and the
|
|
|
|
current input token determine what happens next. If the next token
|
|
|
|
looks like part of a valid grammar rule (based on other items on the
|
|
|
|
stack), it is generally shifted onto the stack. If the top of the
|
|
|
|
stack contains a valid right-hand-side of a grammar rule, it is
|
|
|
|
usually "reduced" and the symbols replaced with the symbol on the
|
|
|
|
left-hand-side. When this reduction occurs, the appropriate action is
|
|
|
|
triggered (if defined). If the input token can't be shifted and the
|
|
|
|
top of stack doesn't match any grammar rules, a syntax error has
|
|
|
|
occurred and the parser must take some kind of recovery step (or bail
|
|
|
|
out). A parse is only successful if the parser reaches a state where
|
|
|
|
the symbol stack is empty and there are no more input tokens.
|
|
|
|
|
|
|
|
It is important to note that the underlying implementation is built
|
|
|
|
around a large finite-state machine that is encoded in a collection of
|
|
|
|
tables. The construction of these tables is non-trivial and
|
|
|
|
beyond the scope of this discussion. However, subtle details of this
|
|
|
|
process explain why, in the example above, the parser chooses to shift
|
|
|
|
a token onto the stack in step 9 rather than reducing the
|
|
|
|
rule ``expr : expr + term``.
|
|
|
|
|
|
|
|
Parsing Example
|
|
|
|
^^^^^^^^^^^^^^^
|
2016-09-09 19:57:39 +02:00
|
|
|
Suppose you wanted to make a grammar for evaluating simple arithmetic
|
|
|
|
expressions as previously described. Here is how you would do it with
|
|
|
|
SLY::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
from sly import Parser
|
|
|
|
from calclex import CalcLexer
|
|
|
|
|
|
|
|
class CalcParser(Parser):
|
|
|
|
# Get the token list from the lexer (required)
|
|
|
|
tokens = CalcLexer.tokens
|
|
|
|
|
|
|
|
# Grammar rules and actions
|
|
|
|
@_('expr PLUS term')
|
|
|
|
def expr(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return p[0] + p[2]
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('expr MINUS term')
|
|
|
|
def expr(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return p[0] - p[2]
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('term')
|
|
|
|
def expr(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return p[0]
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('term TIMES factor')
|
|
|
|
def term(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return p[0] * p[2]
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('term DIVIDE factor')
|
|
|
|
def term(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return p[0] / p[2]
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('factor')
|
|
|
|
def term(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return p[0]
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('NUMBER')
|
|
|
|
def factor(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return p[0]
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('LPAREN expr RPAREN')
|
|
|
|
def factor(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return p[1]
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
# Error rule for syntax errors
|
|
|
|
def error(self, p):
|
|
|
|
print("Syntax error in input!")
|
|
|
|
|
|
|
|
if __name__ == '__main__':
|
|
|
|
lexer = CalcLexer()
|
|
|
|
parser = CalcParser()
|
|
|
|
|
|
|
|
while True:
|
|
|
|
try:
|
|
|
|
text = input('calc > ')
|
|
|
|
result = parser.parse(lexer.tokenize(text))
|
|
|
|
print(result)
|
|
|
|
except EOFError:
|
|
|
|
break
|
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
In this example, each grammar rule is defined by a method that's been
|
|
|
|
decorated by ``@_(rule)`` decorator. The very first grammar rule
|
|
|
|
defines the top of the parse. The name of each method should match
|
|
|
|
the name of the grammar rule being parsed. The argument to the
|
|
|
|
``@_()`` decorator is a string describing the right-hand-side of the
|
|
|
|
grammar. Thus, a grammar rule like this::
|
|
|
|
|
|
|
|
expr : expr PLUS term
|
|
|
|
|
|
|
|
becomes a method like this::
|
|
|
|
|
|
|
|
@_('expr PLUS term')
|
|
|
|
def expr(self, p):
|
|
|
|
...
|
|
|
|
|
|
|
|
The method is triggered when that grammar rule is recognized on the
|
|
|
|
input. As an argument, the method receives a sequence of grammar symbol
|
|
|
|
values ``p`` that is accessed as an array. The mapping between
|
|
|
|
elements of ``p`` and the grammar rule is as shown here::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
# p[0] p[1] p[2]
|
2016-09-08 22:05:03 +02:00
|
|
|
# | | |
|
|
|
|
@_('expr PLUS term')
|
|
|
|
def expr(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
...
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
For tokens, the value of the corresponding ``p[i]`` is the *same* as
|
|
|
|
the ``p.value`` attribute assigned to tokens in the lexer module. For
|
|
|
|
non-terminals, the value is whatever was returned by the methods
|
|
|
|
defined for that rule.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
Within each rule, you return a value that becomes associated with that
|
|
|
|
grammar symbol elsewhere. In the example shown, rules are carrying out
|
|
|
|
the evaluation of an arithmetic expression::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('expr PLUS term')
|
|
|
|
def expr(self, p):
|
|
|
|
return p[0] + p[2]
|
|
|
|
|
|
|
|
There are many other kinds of things that might happen in a rule
|
|
|
|
though. For example, a rule might construct part of a parse tree
|
|
|
|
instead::
|
|
|
|
|
|
|
|
@_('expr PLUS term')
|
|
|
|
def expr(self, p):
|
|
|
|
return ('+', p[0], p[2])
|
|
|
|
|
|
|
|
or perhaps create an instance related to an abstract syntax tree::
|
|
|
|
|
|
|
|
class BinOp(object):
|
|
|
|
def __init__(self, op, left, right):
|
|
|
|
self.op = op
|
|
|
|
self.left = left
|
|
|
|
self.right = right
|
|
|
|
|
|
|
|
@_('expr PLUS term')
|
|
|
|
def expr(self, p):
|
|
|
|
return BinOp('+', p[0], p[2])
|
|
|
|
|
|
|
|
The key thing is that the method returns the value that's going to
|
|
|
|
be attached to the symbol "expr" in this case.
|
|
|
|
|
|
|
|
The ``error()`` method is defined to handle syntax errors (if any).
|
|
|
|
See the error handling section below for more detail.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
Combining Grammar Rule Functions
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
When grammar rules are similar, they can be combined into a single method.
|
|
|
|
For example, suppose you had two rules that were constructing a parse tree::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('expr PLUS term')
|
|
|
|
def expr(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return ('+', p[0], p[2])
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('expr MINUS term')
|
|
|
|
def expr(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return ('-', p[0], p[2])
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
Instead of writing two functions, you might write a single function like this::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('expr PLUS term',
|
|
|
|
'expr MINUS term')
|
|
|
|
def expr(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return (p[1], p[0], p[2])
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
In general, the ``@_()`` decorator for any given method can list
|
|
|
|
multiple grammar rules. When combining grammar rules into a single
|
2016-09-09 19:57:39 +02:00
|
|
|
function though, it is usually a good idea for all of the rules to
|
|
|
|
have a similar structure (e.g., the same number of terms). Otherwise,
|
|
|
|
the corresponding action code may end up being more complicated than
|
|
|
|
necessary.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
Character Literals
|
|
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
If desired, a grammar may contain tokens defined as single character
|
|
|
|
literals. For example::
|
|
|
|
|
|
|
|
@_('expr "+" term')
|
|
|
|
def expr(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return p[0] + p[2]
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('expr "-" term')
|
|
|
|
def expr(self, p):
|
2016-09-09 19:57:39 +02:00
|
|
|
return p[0] - p[2]
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
A character literal must be enclosed in quotes such as ``"+"``. In
|
|
|
|
addition, if literals are used, they must be declared in the
|
|
|
|
corresponding lexer class through the use of a special ``literals``
|
|
|
|
declaration::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
class CalcLexer(Lexer):
|
|
|
|
...
|
|
|
|
literals = ['+','-','*','/' ]
|
|
|
|
...
|
|
|
|
|
|
|
|
Character literals are limited to a single character. Thus, it is not
|
2016-09-09 19:57:39 +02:00
|
|
|
legal to specify literals such as ``<=`` or ``==``. For this, use the
|
|
|
|
normal lexing rules (e.g., define a rule such as ``EQ = r'=='``).
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
Empty Productions
|
|
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
If you need an empty production, define a special rule like this::
|
|
|
|
|
|
|
|
@_('')
|
|
|
|
def empty(self, p):
|
|
|
|
pass
|
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
Now to use the empty production elsewhere, use the name 'empty' as a symbol. For
|
2016-09-08 22:05:03 +02:00
|
|
|
example::
|
|
|
|
|
|
|
|
@_('item')
|
|
|
|
def optitem(self, p):
|
|
|
|
...
|
|
|
|
|
|
|
|
@_('empty')
|
|
|
|
def optitem(self, p):
|
|
|
|
...
|
|
|
|
|
|
|
|
Note: You can write empty rules anywhere by simply specifying an empty
|
2016-09-09 19:57:39 +02:00
|
|
|
string. However,writing an "empty" rule and using "empty" to denote an
|
|
|
|
empty production may be easier to read and more clearly state your
|
|
|
|
intention.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
Changing the starting symbol
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
Normally, the first rule found in a parser class defines the starting
|
|
|
|
grammar rule (top level rule). To change this, supply a ``start``
|
|
|
|
specifier in your class. For example::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
class CalcParser(Parser):
|
|
|
|
start = 'foo'
|
|
|
|
|
|
|
|
@_('A B')
|
|
|
|
def bar(self, p):
|
|
|
|
...
|
|
|
|
|
|
|
|
@_('bar X')
|
|
|
|
def foo(self, p): # Parsing starts here (start symbol above)
|
|
|
|
...
|
|
|
|
|
|
|
|
The use of a ``start`` specifier may be useful during debugging
|
|
|
|
since you can use it to work with a subset of a larger grammar.
|
|
|
|
|
|
|
|
Dealing With Ambiguous Grammars
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
The expression grammar given in the earlier example has been written
|
|
|
|
in a special format to eliminate ambiguity. However, in many
|
|
|
|
situations, it is extremely difficult or awkward to write grammars in
|
|
|
|
this format. A much more natural way to express the grammar is in a
|
|
|
|
more compact form like this::
|
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
expr : expr PLUS expr
|
|
|
|
| expr MINUS expr
|
|
|
|
| expr TIMES expr
|
|
|
|
| expr DIVIDE expr
|
|
|
|
| LPAREN expr RPAREN
|
|
|
|
| NUMBER
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
Unfortunately, this grammar specification is ambiguous. For example,
|
|
|
|
if you are parsing the string "3 * 4 + 5", there is no way to tell how
|
|
|
|
the operators are supposed to be grouped. For example, does the
|
|
|
|
expression mean "(3 * 4) + 5" or is it "3 * (4+5)"?
|
|
|
|
|
|
|
|
When an ambiguous grammar is given, you will get messages about
|
|
|
|
"shift/reduce conflicts" or "reduce/reduce conflicts". A shift/reduce
|
|
|
|
conflict is caused when the parser generator can't decide whether or
|
|
|
|
not to reduce a rule or shift a symbol on the parsing stack. For
|
|
|
|
example, consider the string "3 * 4 + 5" and the internal parsing
|
|
|
|
stack::
|
|
|
|
|
|
|
|
Step Symbol Stack Input Tokens Action
|
|
|
|
---- --------------------- --------------------- -------------------------------
|
|
|
|
1 $ 3 * 4 + 5$ Shift 3
|
|
|
|
2 $ 3 * 4 + 5$ Reduce : expr : NUMBER
|
|
|
|
3 $ expr * 4 + 5$ Shift *
|
|
|
|
4 $ expr * 4 + 5$ Shift 4
|
|
|
|
5 $ expr * 4 + 5$ Reduce: expr : NUMBER
|
|
|
|
6 $ expr * expr + 5$ SHIFT/REDUCE CONFLICT ????
|
|
|
|
|
|
|
|
In this case, when the parser reaches step 6, it has two options. One
|
|
|
|
is to reduce the rule ``expr : expr * expr`` on the stack. The other
|
|
|
|
option is to shift the token ``+`` on the stack. Both options are
|
|
|
|
perfectly legal from the rules of the context-free-grammar.
|
|
|
|
|
|
|
|
By default, all shift/reduce conflicts are resolved in favor of
|
|
|
|
shifting. Therefore, in the above example, the parser will always
|
2016-09-09 19:57:39 +02:00
|
|
|
shift the ``+`` instead of reducing. Although this strategy works in
|
|
|
|
many cases (for example, the case of "if-then" versus "if-then-else"),
|
|
|
|
it is not enough for arithmetic expressions. In fact, in the above
|
|
|
|
example, the decision to shift ``+`` is completely wrong---we should
|
|
|
|
have reduced ``expr * expr`` since multiplication has higher
|
|
|
|
mathematical precedence than addition.
|
|
|
|
|
|
|
|
To resolve ambiguity, especially in expression grammars, SLY allows
|
|
|
|
individual tokens to be assigned a precedence level and associativity.
|
|
|
|
This is done by adding a variable ``precedence`` to the parser class
|
|
|
|
like this::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
class CalcParser(Parser):
|
|
|
|
...
|
|
|
|
precedence = (
|
|
|
|
('left', 'PLUS', 'MINUS'),
|
|
|
|
('left', 'TIMES', 'DIVIDE'),
|
|
|
|
)
|
|
|
|
...
|
|
|
|
|
|
|
|
This declaration specifies that ``PLUS``/``MINUS`` have the
|
|
|
|
same precedence level and are left-associative and that
|
|
|
|
``TIMES``/``DIVIDE`` have the same precedence and are
|
|
|
|
left-associative. Within the ``precedence`` declaration, tokens
|
|
|
|
are ordered from lowest to highest precedence. Thus, this declaration
|
|
|
|
specifies that ``TIMES``/``DIVIDE`` have higher precedence
|
|
|
|
than ``PLUS``/``MINUS`` (since they appear later in the
|
|
|
|
precedence specification).
|
|
|
|
|
|
|
|
The precedence specification works by associating a numerical
|
|
|
|
precedence level value and associativity direction to the listed
|
|
|
|
tokens. For example, in the above example you get::
|
|
|
|
|
|
|
|
PLUS : level = 1, assoc = 'left'
|
|
|
|
MINUS : level = 1, assoc = 'left'
|
|
|
|
TIMES : level = 2, assoc = 'left'
|
|
|
|
DIVIDE : level = 2, assoc = 'left'
|
|
|
|
|
|
|
|
These values are then used to attach a numerical precedence value and
|
|
|
|
associativity direction to each grammar rule. *This is always
|
|
|
|
determined by looking at the precedence of the right-most terminal
|
|
|
|
symbol.* For example::
|
|
|
|
|
|
|
|
expr : expr PLUS expr # level = 1, left
|
|
|
|
| expr MINUS expr # level = 1, left
|
|
|
|
| expr TIMES expr # level = 2, left
|
|
|
|
| expr DIVIDE expr # level = 2, left
|
|
|
|
| LPAREN expr RPAREN # level = None (not specified)
|
|
|
|
| NUMBER # level = None (not specified)
|
|
|
|
|
|
|
|
When shift/reduce conflicts are encountered, the parser generator
|
|
|
|
resolves the conflict by looking at the precedence rules and
|
|
|
|
associativity specifiers.
|
|
|
|
|
|
|
|
1. If the current token has higher precedence than the rule on the stack, it is shifted.
|
|
|
|
|
|
|
|
2. If the grammar rule on the stack has higher precedence, the rule is reduced.
|
|
|
|
|
|
|
|
3. If the current token and the grammar rule have the same precedence, the
|
|
|
|
rule is reduced for left associativity, whereas the token is shifted for right associativity.
|
|
|
|
|
|
|
|
4. If nothing is known about the precedence, shift/reduce conflicts are resolved in
|
|
|
|
favor of shifting (the default).
|
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
For example, if ``expr PLUS expr`` has been parsed and the
|
|
|
|
next token is ``TIMES``, the action is going to be a shift because
|
|
|
|
``TIMES`` has a higher precedence level than ``PLUS``. On the other hand,
|
|
|
|
if ``expr TIMES expr`` has been parsed and the next token is
|
|
|
|
``PLUS``, the action is going to be reduce because ``PLUS`` has a lower
|
|
|
|
precedence than ``TIMES.``
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
When shift/reduce conflicts are resolved using the first three
|
|
|
|
techniques (with the help of precedence rules), SLY will
|
|
|
|
report no errors or conflicts in the grammar.
|
|
|
|
|
|
|
|
One problem with the precedence specifier technique is that it is
|
|
|
|
sometimes necessary to change the precedence of an operator in certain
|
2016-09-09 19:57:39 +02:00
|
|
|
contexts. For example, consider a unary-minus operator in ``3 + 4 *
|
|
|
|
-5``. Mathematically, the unary minus is normally given a very high
|
2016-09-08 22:05:03 +02:00
|
|
|
precedence--being evaluated before the multiply. However, in our
|
2016-09-09 19:57:39 +02:00
|
|
|
precedence specifier, ``MINUS`` has a lower precedence than ``TIMES``. To
|
2016-09-08 22:05:03 +02:00
|
|
|
deal with this, precedence rules can be given for so-called "fictitious tokens"
|
|
|
|
like this::
|
|
|
|
|
|
|
|
class CalcParser(Parser):
|
|
|
|
...
|
|
|
|
precedence = (
|
|
|
|
('left', 'PLUS', 'MINUS'),
|
|
|
|
('left', 'TIMES', 'DIVIDE'),
|
|
|
|
('right', 'UMINUS'), # Unary minus operator
|
|
|
|
)
|
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
Now, in the grammar file, you write the unary minus rule like this::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
@_('MINUS expr %prec UMINUS')
|
|
|
|
def expr(p):
|
|
|
|
p[0] = -p[2]
|
|
|
|
|
|
|
|
In this case, ``%prec UMINUS`` overrides the default rule precedence--setting it to that
|
2016-09-09 19:57:39 +02:00
|
|
|
of ``UMINUS`` in the precedence specifier.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
At first, the use of ``UMINUS`` in this example may appear very confusing.
|
|
|
|
``UMINUS`` is not an input token or a grammar rule. Instead, you should
|
2016-09-08 22:05:03 +02:00
|
|
|
think of it as the name of a special marker in the precedence table.
|
2016-09-09 19:57:39 +02:00
|
|
|
When you use the ``%prec`` qualifier, you're telling SLY
|
2016-09-08 22:05:03 +02:00
|
|
|
that you want the precedence of the expression to be the same as for
|
|
|
|
this special marker instead of the usual precedence.
|
|
|
|
|
|
|
|
It is also possible to specify non-associativity in the ``precedence`` table. This would
|
|
|
|
be used when you *don't* want operations to chain together. For example, suppose
|
|
|
|
you wanted to support comparison operators like ``<`` and ``>`` but you didn't want to allow
|
|
|
|
combinations like ``a < b < c``. To do this, specify a rule like this::
|
|
|
|
|
|
|
|
class MyParser(Parser):
|
|
|
|
...
|
|
|
|
precedence = (
|
|
|
|
('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative operators
|
|
|
|
('left', 'PLUS', 'MINUS'),
|
|
|
|
('left', 'TIMES', 'DIVIDE'),
|
|
|
|
('right', 'UMINUS'), # Unary minus operator
|
|
|
|
)
|
|
|
|
|
|
|
|
If you do this, the occurrence of input text such as ``a < b < c``
|
|
|
|
will result in a syntax error. However, simple expressions such as
|
|
|
|
``a < b`` will still be fine.
|
|
|
|
|
|
|
|
Reduce/reduce conflicts are caused when there are multiple grammar
|
|
|
|
rules that can be applied to a given set of symbols. This kind of
|
|
|
|
conflict is almost always bad and is always resolved by picking the
|
|
|
|
rule that appears first in the grammar file. Reduce/reduce conflicts
|
|
|
|
are almost always caused when different sets of grammar rules somehow
|
|
|
|
generate the same set of symbols. For example::
|
|
|
|
|
|
|
|
assignment : ID EQUALS NUMBER
|
|
|
|
| ID EQUALS expr
|
|
|
|
|
|
|
|
expr : expr PLUS expr
|
|
|
|
| expr MINUS expr
|
|
|
|
| expr TIMES expr
|
|
|
|
| expr DIVIDE expr
|
|
|
|
| LPAREN expr RPAREN
|
|
|
|
| NUMBER
|
|
|
|
|
|
|
|
In this case, a reduce/reduce conflict exists between these two rules::
|
|
|
|
|
|
|
|
assignment : ID EQUALS NUMBER
|
|
|
|
expr : NUMBER
|
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
For example, if you're parsing ``a = 5``, the parser can't figure out if this
|
2016-09-08 22:05:03 +02:00
|
|
|
is supposed to be reduced as ``assignment : ID EQUALS NUMBER`` or
|
|
|
|
whether it's supposed to reduce the 5 as an expression and then reduce
|
|
|
|
the rule ``assignment : ID EQUALS expr``.
|
|
|
|
|
|
|
|
It should be noted that reduce/reduce conflicts are notoriously
|
|
|
|
difficult to spot simply looking at the input grammar. When a
|
|
|
|
reduce/reduce conflict occurs, SLY will try to help by
|
|
|
|
printing a warning message such as this::
|
|
|
|
|
|
|
|
WARNING: 1 reduce/reduce conflict
|
|
|
|
WARNING: reduce/reduce conflict in state 15 resolved using rule (assignment -> ID EQUALS NUMBER)
|
|
|
|
WARNING: rejected rule (expression -> NUMBER)
|
|
|
|
|
|
|
|
This message identifies the two rules that are in conflict. However,
|
|
|
|
it may not tell you how the parser arrived at such a state. To try
|
|
|
|
and figure it out, you'll probably have to look at your grammar and
|
|
|
|
the contents of the parser debugging file with an appropriately high
|
|
|
|
level of caffeination (see the next section).
|
|
|
|
|
|
|
|
Parser Debugging
|
|
|
|
^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
Tracking down shift/reduce and reduce/reduce conflicts is one of the
|
|
|
|
finer pleasures of using an LR parsing algorithm. To assist in
|
2016-09-09 19:57:39 +02:00
|
|
|
debugging, you can have SLY produce a debugging file when it
|
|
|
|
constructs the parsing tables. Add a ``debugfile`` attribute to your
|
|
|
|
class like this::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
class CalcParser(Parser):
|
|
|
|
debugfile = 'parser.out'
|
|
|
|
...
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
When present, this will write the entire grammar along with all parsing
|
|
|
|
states to the file you specify. Each state of the parser is shown
|
|
|
|
as output that looks something like this::
|
|
|
|
|
|
|
|
state 2
|
|
|
|
|
|
|
|
(7) factor -> LPAREN . expr RPAREN
|
|
|
|
(1) expr -> . term
|
|
|
|
(2) expr -> . expr MINUS term
|
|
|
|
(3) expr -> . expr PLUS term
|
|
|
|
(4) term -> . factor
|
|
|
|
(5) term -> . term DIVIDE factor
|
|
|
|
(6) term -> . term TIMES factor
|
|
|
|
(7) factor -> . LPAREN expr RPAREN
|
|
|
|
(8) factor -> . NUMBER
|
|
|
|
LPAREN shift and go to state 2
|
|
|
|
NUMBER shift and go to state 3
|
|
|
|
|
|
|
|
factor shift and go to state 1
|
|
|
|
term shift and go to state 4
|
|
|
|
expr shift and go to state 6
|
|
|
|
|
|
|
|
Each state keeps track of the grammar rules that might be in the
|
|
|
|
process of being matched at that point. Within each rule, the "."
|
|
|
|
character indicates the current location of the parse within that
|
|
|
|
rule. In addition, the actions for each valid input token are listed.
|
|
|
|
By looking at these rules (and with a little practice), you can
|
|
|
|
usually track down the source of most parsing conflicts. It should
|
|
|
|
also be stressed that not all shift-reduce conflicts are bad.
|
|
|
|
However, the only way to be sure that they are resolved correctly is
|
|
|
|
to look at the debugging file.
|
|
|
|
|
|
|
|
Syntax Error Handling
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
If you are creating a parser for production use, the handling of
|
|
|
|
syntax errors is important. As a general rule, you don't want a
|
|
|
|
parser to simply throw up its hands and stop at the first sign of
|
|
|
|
trouble. Instead, you want it to report the error, recover if
|
|
|
|
possible, and continue parsing so that all of the errors in the input
|
|
|
|
get reported to the user at once. This is the standard behavior found
|
|
|
|
in compilers for languages such as C, C++, and Java.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
In SLY, when a syntax error occurs during parsing, the error is immediately
|
|
|
|
detected (i.e., the parser does not read any more tokens beyond the
|
|
|
|
source of the error). However, at this point, the parser enters a
|
|
|
|
recovery mode that can be used to try and continue further parsing.
|
|
|
|
As a general rule, error recovery in LR parsers is a delicate
|
|
|
|
topic that involves ancient rituals and black-magic. The recovery mechanism
|
|
|
|
provided by SLY is comparable to Unix yacc so you may want
|
|
|
|
consult a book like O'Reilly's "Lex and Yacc" for some of the finer details.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
When a syntax error occurs, SLY performs the following steps:
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
1. On the first occurrence of an error, the user-defined ``error()``
|
|
|
|
method is called with the offending token as an argument. However, if
|
|
|
|
the syntax error is due to reaching the end-of-file, an argument of
|
|
|
|
``None`` is passed. Afterwards, the parser enters an "error-recovery"
|
|
|
|
mode in which it will not make future calls to ``error()`` until it
|
|
|
|
has successfully shifted at least 3 tokens onto the parsing stack.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
2. If no recovery action is taken in ``error()``, the offending
|
|
|
|
lookahead token is replaced with a special ``error`` token.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
3. If the offending lookahead token is already set to ``error``,
|
|
|
|
the top item of the parsing stack is deleted.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
4. If the entire parsing stack is unwound, the parser enters a restart
|
|
|
|
state and attempts to start parsing from its initial state.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
5. If a grammar rule accepts ``error`` as a token, it will be
|
|
|
|
shifted onto the parsing stack.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
6. If the top item of the parsing stack is ``error``, lookahead tokens will be discarded until the
|
|
|
|
parser can successfully shift a new symbol or reduce a rule involving ``error``.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
Recovery and resynchronization with error rules
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
The most well-behaved approach for handling syntax errors is to write
|
|
|
|
grammar rules that include the ``error`` token. For example,
|
|
|
|
suppose your language had a grammar rule for a print statement like
|
|
|
|
this::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('PRINT expr SEMI')
|
|
|
|
def statement(self, p):
|
|
|
|
...
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
To account for the possibility of a bad expression, you might write an
|
|
|
|
additional grammar rule like this::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('PRINT error SEMI')
|
|
|
|
def statement(self, p):
|
|
|
|
print("Syntax error in print statement. Bad expression")
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
In this case, the ``error`` token will match any sequence of
|
|
|
|
tokens that might appear up to the first semicolon that is
|
|
|
|
encountered. Once the semicolon is reached, the rule will be
|
|
|
|
invoked and the ``error`` token will go away.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
This type of recovery is sometimes known as parser resynchronization.
|
|
|
|
The ``error`` token acts as a wildcard for any bad input text and
|
|
|
|
the token immediately following ``error`` acts as a
|
|
|
|
synchronization token.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
It is important to note that the ``error`` token usually does not
|
|
|
|
appear as the last token on the right in an error rule. For example::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('PRINT error')
|
|
|
|
def statement(self, p):
|
|
|
|
print("Syntax error in print statement. Bad expression")
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
This is because the first bad token encountered will cause the rule to
|
|
|
|
be reduced--which may make it difficult to recover if more bad tokens
|
|
|
|
immediately follow.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
Panic mode recovery
|
|
|
|
~~~~~~~~~~~~~~~~~~~
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
An alternative error recovery scheme is to enter a panic mode recovery
|
|
|
|
in which tokens are discarded to a point where the parser might be
|
|
|
|
able to recover in some sensible manner.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
Panic mode recovery is implemented entirely in the ``error()``
|
|
|
|
function. For example, this function starts discarding tokens until
|
|
|
|
it reaches a closing '}'. Then, it restarts the parser in its initial
|
|
|
|
state::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
def error(self, p):
|
|
|
|
print("Whoa. You are seriously hosed.")
|
|
|
|
if not p:
|
|
|
|
print("End of File!")
|
|
|
|
return
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
# Read ahead looking for a closing '}'
|
|
|
|
while True:
|
|
|
|
tok = next(self.tokens, None)
|
|
|
|
if not tok or tok.type == 'RBRACE':
|
|
|
|
break
|
|
|
|
self.restart()
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
This function simply discards the bad token and tells the parser that
|
|
|
|
the error was ok::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
def error(self, p):
|
|
|
|
if p:
|
|
|
|
print("Syntax error at token", p.type)
|
|
|
|
# Just discard the token and tell the parser it's okay.
|
|
|
|
self.errok()
|
|
|
|
else:
|
|
|
|
print("Syntax error at EOF")
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
A few additional details about some of the attributes and methods being used:
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
- ``self.errok()``. This resets the parser state so it doesn't think
|
|
|
|
it's in error-recovery mode. This will prevent an ``error`` token
|
|
|
|
from being generated and will reset the internal error counters so
|
|
|
|
that the next syntax error will call ``error()`` again.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
- ``self.tokens``. This is the iterable sequence of tokens being parsed. Calling
|
|
|
|
``next(self.tokens)`` will force it to advance by one token.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
- ``self.restart()``. This discards the entire parsing stack and
|
|
|
|
resets the parser to its initial state.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
To supply the next lookahead token to the parser, ``error()`` can return a token. This might be
|
|
|
|
useful if trying to synchronize on special characters. For example::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
def error(self, tok):
|
|
|
|
# Read ahead looking for a terminating ";"
|
|
|
|
while True:
|
|
|
|
tok = next(self.tokens, None) # Get the next token
|
|
|
|
if not tok or tok.type == 'SEMI':
|
|
|
|
break
|
|
|
|
self.errok()
|
|
|
|
|
|
|
|
# Return SEMI to the parser as the next lookahead token
|
|
|
|
return tok
|
|
|
|
|
|
|
|
When Do Syntax Errors Get Reported?
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
In most cases, SLY will handle errors as soon as a bad input token is
|
|
|
|
detected on the input. However, be aware that SLY may choose to delay
|
|
|
|
error handling until after it has reduced one or more grammar rules
|
|
|
|
first. This behavior might be unexpected, but it's related to special
|
|
|
|
states in the underlying parsing table known as "defaulted states." A
|
|
|
|
defaulted state is parsing condition where the same grammar rule will
|
|
|
|
be reduced regardless of what valid token comes next on the input.
|
|
|
|
For such states, SLY chooses to go ahead and reduce the grammar rule
|
|
|
|
*without reading the next input token*. If the next token is bad, SLY
|
|
|
|
will eventually get around to reading it and report a syntax error.
|
|
|
|
It's just a little unusual in that you might see some of your grammar
|
|
|
|
rules firing immediately prior to the syntax error.
|
|
|
|
|
|
|
|
General comments on error handling
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
For normal types of languages, error recovery with error rules and
|
|
|
|
resynchronization characters is probably the most reliable
|
|
|
|
technique. This is because you can instrument the grammar to catch
|
|
|
|
errors at selected places where it is relatively easy to recover and
|
|
|
|
continue parsing. Panic mode recovery is really only useful in
|
|
|
|
certain specialized applications where you might want to discard huge
|
|
|
|
portions of the input text to find a valid restart point.
|
|
|
|
|
|
|
|
Line Number and Position Tracking
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
Position tracking is often a tricky problem when writing compilers.
|
|
|
|
By default, SLY tracks the line number and position of all tokens.
|
|
|
|
The following attributes may be useful in a production method:
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
- ``p.lineno``. Line number of the left-most terminal in a production.
|
|
|
|
- ``p.index``. Lexing index of the left-most terminal in a production.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
For example::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('expr PLUS expr')
|
|
|
|
def expr(self, p):
|
|
|
|
line = p.lineno # line number of the PLUS token
|
|
|
|
index = p.index # Index of the PLUS token in input text
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
SLY doesn't propagate line number information to non-terminals. If you need
|
|
|
|
this, you'll need to store line number information yourself and propagate it
|
|
|
|
in AST nodes or some other data structure.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
AST Construction
|
|
|
|
^^^^^^^^^^^^^^^^
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
SLY provides no special functions for constructing an abstract syntax
|
|
|
|
tree. However, such construction is easy enough to do on your own.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
A minimal way to construct a tree is to simply create and
|
|
|
|
propagate a tuple or list in each grammar rule function. There
|
|
|
|
are many possible ways to do this, but one example would be something
|
|
|
|
like this::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('expr PLUS expr',
|
|
|
|
'expr MINUS expr',
|
|
|
|
'expr TIMES expr',
|
|
|
|
'expr DIVIDE expr')
|
|
|
|
def expr(self, p):
|
|
|
|
return ('binary-expression', p[1], p[0], p[2])
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('LPAREN expr RPAREN')
|
|
|
|
def expr(self, p):
|
|
|
|
return ('group-expression',p[1])
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('NUMBER')
|
|
|
|
def expr(self, p):
|
|
|
|
return ('number-expression', p[0])
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
Another approach is to create a set of data structure for different
|
|
|
|
kinds of abstract syntax tree nodes and create different node types
|
|
|
|
in each rule::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
class Expr:
|
|
|
|
pass
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
class BinOp(Expr):
|
|
|
|
def __init__(self, op, left, right)
|
|
|
|
self.op = op
|
|
|
|
self.left = left
|
|
|
|
self.right = right
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
class Number(Expr):
|
|
|
|
def __init__(self, value):
|
|
|
|
self.value = value
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('expr PLUS expr',
|
|
|
|
'expr MINUS expr',
|
|
|
|
'expr TIMES expr',
|
|
|
|
'expr DIVIDE expr')
|
|
|
|
def expr(self, p):
|
|
|
|
return BinOp(p[1], p[0], p[2])
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('LPAREN expr RPAREN')
|
|
|
|
def expr(self, p):
|
|
|
|
return p[1]
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('NUMBER')
|
|
|
|
def expr(self, p):
|
|
|
|
return Number(p[0])
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
The advantage to this approach is that it may make it easier to attach
|
|
|
|
more complicated semantics, type checking, code generation, and other
|
|
|
|
features to the node classes.
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
Embedded Actions
|
|
|
|
^^^^^^^^^^^^^^^^
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
The parsing technique used by SLY only allows actions to be executed
|
|
|
|
at the end of a rule. For example, suppose you have a rule like this::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('A B C D')
|
|
|
|
def foo(self, p):
|
|
|
|
print("Parsed a foo", p[0],p[1],p[2],p[3])
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
In this case, the supplied action code only executes after all of the
|
|
|
|
symbols ``A``, ``B``, ``C``, and ``D`` have been
|
|
|
|
parsed. Sometimes, however, it is useful to execute small code
|
|
|
|
fragments during intermediate stages of parsing. For example, suppose
|
|
|
|
you wanted to perform some action immediately after ``A`` has
|
|
|
|
been parsed. To do this, write an empty rule like this::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('A seen_A B C D')
|
|
|
|
def foo(self, p):
|
|
|
|
print("Parsed a foo", p[0],p[2],p[3],p[4])
|
|
|
|
print("seen_A returned", p[1])
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('')
|
|
|
|
def seen_A(self, p):
|
|
|
|
print("Saw an A = ", p[-1]) # Access grammar symbol to the left
|
|
|
|
return 'some_value' # Assign value to seen_A
|
|
|
|
|
|
|
|
In this example, the empty ``seen_A`` rule executes immediately after
|
|
|
|
``A`` is shifted onto the parsing stack. Within this rule, ``p[-1]``
|
|
|
|
refers to the symbol on the stack that appears immediately to the left
|
|
|
|
of the ``seen_A`` symbol. In this case, it would be the value of
|
|
|
|
``A`` in the ``foo`` rule immediately above. Like other rules, a
|
|
|
|
value can be returned from an embedded action by returning it.
|
|
|
|
|
|
|
|
The use of embedded actions can sometimes introduce extra shift/reduce
|
|
|
|
conflicts. For example, this grammar has no conflicts::
|
|
|
|
|
|
|
|
@_('abcd',
|
|
|
|
'abcx')
|
|
|
|
def foo(self, p):
|
|
|
|
pass
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('A B C D')
|
|
|
|
def abcd(self, p):
|
|
|
|
pass
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('A B C X')
|
|
|
|
def abcx(self, p):
|
|
|
|
pass
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
However, if you insert an embedded action into one of the rules like this::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('abcd',
|
|
|
|
'abcx')
|
|
|
|
def foo(self, p):
|
|
|
|
pass
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('A B C D')
|
|
|
|
def abcd(self, p):
|
|
|
|
pass
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('A B seen_AB C X')
|
|
|
|
def abcx(self, p):
|
|
|
|
pass
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('')
|
|
|
|
def seen_AB(self, p):
|
|
|
|
pass
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
an extra shift-reduce conflict will be introduced. This conflict is
|
2016-09-09 19:57:39 +02:00
|
|
|
caused by the fact that the same symbol ``C`` appears next in
|
|
|
|
both the ``abcd`` and ``abcx`` rules. The parser can either
|
|
|
|
shift the symbol (``abcd`` rule) or reduce the empty
|
|
|
|
rule ``seen_AB`` (``abcx`` rule).
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
A common use of embedded rules is to control other aspects of parsing
|
2016-09-09 19:57:39 +02:00
|
|
|
such as scoping of local variables. For example, if you were parsing
|
|
|
|
C code, you might write code like this::
|
2016-09-08 22:05:03 +02:00
|
|
|
|
2016-09-09 19:57:39 +02:00
|
|
|
@_('LBRACE new_scope statements RBRACE')
|
|
|
|
def statements(self, p):
|
|
|
|
# Action code
|
|
|
|
...
|
|
|
|
pop_scope() # Return to previous scope
|
|
|
|
|
|
|
|
@_('')
|
|
|
|
def new_scope(self, p):
|
|
|
|
# Create a new scope for local variables
|
|
|
|
create_scope()
|
|
|
|
...
|
|
|
|
|
|
|
|
In this case, the embedded action ``new_scope`` executes
|
|
|
|
immediately after a ``LBRACE`` (``{``) symbol is parsed.
|
|
|
|
This might adjust internal symbol tables and other aspects of the
|
|
|
|
parser. Upon completion of the rule ``statements``, code
|
|
|
|
undos the operations performed in the embedded action
|
|
|
|
(e.g., ``pop_scope()``).
|
2016-09-08 22:05:03 +02:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|