Merge pull request #8 from xpvpc/master

cosmetic changes to docs
This commit is contained in:
David Beazley 2018-05-19 06:29:13 -05:00 committed by GitHub
commit 995d0ecff1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -2,9 +2,9 @@ SLY (Sly Lex Yacc)
================== ==================
This document provides an overview of lexing and parsing with SLY. This document provides an overview of lexing and parsing with SLY.
Given the intrinsic complexity of parsing, I would strongly advise Given the intrinsic complexity of parsing, I would strongly advise
that you read (or at least skim) this entire document before jumping that you read (or at least skim) this entire document before jumping
into a big development project with SLY. into a big development project with SLY.
SLY requires Python 3.6 or newer. If you're using an older version, SLY requires Python 3.6 or newer. If you're using an older version,
you're out of luck. Sorry. you're out of luck. Sorry.
@ -54,10 +54,10 @@ The first step of parsing is to break the text into tokens where
each token has a type and value. For example, the above text might be each token has a type and value. For example, the above text might be
described by the following list of token tuples:: described by the following list of token tuples::
[ ('ID','x'), ('EQUALS','='), ('NUMBER','3'), [ ('ID','x'), ('EQUALS','='), ('NUMBER','3'),
('PLUS','+'), ('NUMBER','42'), ('TIMES','*'), ('PLUS','+'), ('NUMBER','42'), ('TIMES','*'),
('LPAREN','('), ('ID','s'), ('MINUS','-'), ('LPAREN','('), ('ID','s'), ('MINUS','-'),
('ID','t'), ('RPAREN',')' ] ('ID','t'), ('RPAREN',')') ]
The SLY ``Lexer`` class is used to do this. Here is a sample of a simple The SLY ``Lexer`` class is used to do this. Here is a sample of a simple
lexer that tokenizes the above text:: lexer that tokenizes the above text::
@ -68,7 +68,7 @@ lexer that tokenizes the above text::
class CalcLexer(Lexer): class CalcLexer(Lexer):
# Set of token names. This is always required # Set of token names. This is always required
tokens = { ID, NUMBER, PLUS, MINUS, TIMES, tokens = { ID, NUMBER, PLUS, MINUS, TIMES,
DIVIDE, ASSIGN, LPAREN, RPAREN } DIVIDE, ASSIGN, LPAREN, RPAREN }
# String containing ignored characters between tokens # String containing ignored characters between tokens
@ -108,7 +108,7 @@ When executed, the example will produce the following output::
A lexer only has one public method ``tokenize()``. This is a generator A lexer only has one public method ``tokenize()``. This is a generator
function that produces a stream of ``Token`` instances. function that produces a stream of ``Token`` instances.
The ``type`` and ``value`` attributes of ``Token`` contain the The ``type`` and ``value`` attributes of ``Token`` contain the
token type name and value respectively. token type name and value respectively.
The tokens set The tokens set
^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
@ -122,11 +122,11 @@ In the example, the following code specified the token names::
class CalcLexer(Lexer): class CalcLexer(Lexer):
... ...
# Set of token names. This is always required # Set of token names. This is always required
tokens = { ID, NUMBER, PLUS, MINUS, TIMES, tokens = { ID, NUMBER, PLUS, MINUS, TIMES,
DIVIDE, ASSIGN, LPAREN, RPAREN } DIVIDE, ASSIGN, LPAREN, RPAREN }
... ...
Token names should be specified using all-caps as shown. Token names should be specified using all-caps as shown.
Specification of token match patterns Specification of token match patterns
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -139,7 +139,7 @@ names of the tokens provided in the ``tokens`` set. For example::
MINUS = r'-' MINUS = r'-'
Regular expression patterns are compiled using the ``re.VERBOSE`` flag Regular expression patterns are compiled using the ``re.VERBOSE`` flag
which can be used to help readability. However, which can be used to help readability. However,
unescaped whitespace is ignored and comments are allowed in this mode. unescaped whitespace is ignored and comments are allowed in this mode.
If your pattern involves whitespace, make sure you use ``\s``. If you If your pattern involves whitespace, make sure you use ``\s``. If you
need to match the ``#`` character, use ``[#]`` or ``\#``. need to match the ``#`` character, use ``[#]`` or ``\#``.
@ -189,8 +189,8 @@ comments and newlines::
... ...
if __name__ == '__main__': if __name__ == '__main__':
data = '''x = 3 + 42 data = '''x = 3 + 42
* (s # This is a comment * (s # This is a comment
- t)''' - t)'''
lexer = CalcLexer() lexer = CalcLexer()
for tok in lexer.tokenize(data): for tok in lexer.tokenize(data):
@ -219,7 +219,7 @@ object should be returned as a result. If no value is returned by the
function, the token is discarded and the next token read. function, the token is discarded and the next token read.
The ``@_()`` decorator is defined automatically within the ``Lexer`` The ``@_()`` decorator is defined automatically within the ``Lexer``
class--you don't need to do any kind of special import for it. class--you don't need to do any kind of special import for it.
It can also accept multiple regular expression rules. For example:: It can also accept multiple regular expression rules. For example::
@_(r'0x[0-9a-fA-F]+', @_(r'0x[0-9a-fA-F]+',
@ -249,8 +249,8 @@ behavior.
Token Remapping Token Remapping
^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
Occasionally, you might need to remap tokens based on special cases. Occasionally, you might need to remap tokens based on special cases.
Consider the case of matching identifiers such as "abc", "python", or "guido". Consider the case of matching identifiers such as "abc", "python", or "guido".
Certain identifiers such as "if", "else", and "while" might need to be Certain identifiers such as "if", "else", and "while" might need to be
treated as special keywords. To handle this, include token remapping rules when treated as special keywords. To handle this, include token remapping rules when
writing the lexer like this:: writing the lexer like this::
@ -272,7 +272,7 @@ writing the lexer like this::
ID['else'] = ELSE ID['else'] = ELSE
ID['while'] = WHILE ID['while'] = WHILE
When parsing an identifier, the special cases will remap certain matching When parsing an identifier, the special cases will remap certain matching
values to a new token type. For example, if the value of an identifier is values to a new token type. For example, if the value of an identifier is
"if" above, an ``IF`` token will be generated. "if" above, an ``IF`` token will be generated.
@ -300,7 +300,7 @@ it does record positional information related to each token in the token's
column information as a separate step. For instance, you can search column information as a separate step. For instance, you can search
backwards until you reach the previous newline:: backwards until you reach the previous newline::
# Compute column. # Compute column.
# input is the input text string # input is the input text string
# token is a token instance # token is a token instance
def find_column(text, token): def find_column(text, token):
@ -389,13 +389,13 @@ some other kind of error handling.
A More Complete Example A More Complete Example
^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
Here is a more complete example that puts many of these concepts Here is a more complete example that puts many of these concepts
into practice:: into practice::
# calclex.py # calclex.py
from sly import Lexer from sly import Lexer
class CalcLexer(Lexer): class CalcLexer(Lexer):
# Set of token names. This is always required # Set of token names. This is always required
tokens = { NUMBER, ID, WHILE, IF, ELSE, PRINT, tokens = { NUMBER, ID, WHILE, IF, ELSE, PRINT,
@ -420,7 +420,7 @@ into practice::
GE = r'>=' GE = r'>='
GT = r'>' GT = r'>'
NE = r'!=' NE = r'!='
@_(r'\d+') @_(r'\d+')
def NUMBER(self, t): def NUMBER(self, t):
t.value = int(t.value) t.value = int(t.value)
@ -505,7 +505,7 @@ specification like this::
expr : expr + term expr : expr + term
| expr - term | expr - term
| term | term
term : term * factor term : term * factor
| term / factor | term / factor
| factor | factor
@ -532,7 +532,7 @@ example, given the expression grammar above, you might write the
specification for the operation of a simple calculator like this:: specification for the operation of a simple calculator like this::
Grammar Action Grammar Action
------------------------ -------------------------------- ------------------------ --------------------------------
expr0 : expr1 + term expr0.val = expr1.val + term.val expr0 : expr1 + term expr0.val = expr1.val + term.val
| expr1 - term expr0.val = expr1.val - term.val | expr1 - term expr0.val = expr1.val - term.val
| term expr0.val = term.val | term expr0.val = term.val
@ -549,7 +549,7 @@ values then propagate according to the actions described above. For
example, ``factor.val = int(NUMBER.val)`` propagates the value from example, ``factor.val = int(NUMBER.val)`` propagates the value from
``NUMBER`` to ``factor``. ``term0.val = factor.val`` propagates the ``NUMBER`` to ``factor``. ``term0.val = factor.val`` propagates the
value from ``factor`` to ``term``. Rules such as ``expr0.val = value from ``factor`` to ``term``. Rules such as ``expr0.val =
expr1.val + term1.val`` combine and propagate values further. Just to expr1.val + term1.val`` combine and propagate values further. Just to
illustrate, here is how values propagate in the expression ``2 + 3 * 4``:: illustrate, here is how values propagate in the expression ``2 + 3 * 4``::
NUMBER.val=2 + NUMBER.val=3 * NUMBER.val=4 # NUMBER -> factor NUMBER.val=2 + NUMBER.val=3 * NUMBER.val=4 # NUMBER -> factor
@ -560,7 +560,7 @@ illustrate, here is how values propagate in the expression ``2 + 3 * 4``::
expr.val=2 + term.val=3 * NUMBER.val=4 # NUMBER -> factor expr.val=2 + term.val=3 * NUMBER.val=4 # NUMBER -> factor
expr.val=2 + term.val=3 * factor.val=4 # term * factor -> term expr.val=2 + term.val=3 * factor.val=4 # term * factor -> term
expr.val=2 + term.val=12 # expr + term -> expr expr.val=2 + term.val=12 # expr + term -> expr
expr.val=14 expr.val=14
SLY uses a parsing technique known as LR-parsing or shift-reduce SLY uses a parsing technique known as LR-parsing or shift-reduce
parsing. LR parsing is a bottom up technique that tries to recognize parsing. LR parsing is a bottom up technique that tries to recognize
@ -1050,7 +1050,7 @@ generate the same set of symbols. For example::
assignment : ID EQUALS NUMBER assignment : ID EQUALS NUMBER
| ID EQUALS expr | ID EQUALS expr
expr : expr PLUS expr expr : expr PLUS expr
| expr MINUS expr | expr MINUS expr
| expr TIMES expr | expr TIMES expr
@ -1101,7 +1101,7 @@ states to the file you specify. Each state of the parser is shown
as output that looks something like this:: as output that looks something like this::
state 2 state 2
(7) factor -> LPAREN . expr RPAREN (7) factor -> LPAREN . expr RPAREN
(1) expr -> . term (1) expr -> . term
(2) expr -> . expr MINUS term (2) expr -> . expr MINUS term
@ -1113,7 +1113,7 @@ as output that looks something like this::
(8) factor -> . NUMBER (8) factor -> . NUMBER
LPAREN shift and go to state 2 LPAREN shift and go to state 2
NUMBER shift and go to state 3 NUMBER shift and go to state 3
factor shift and go to state 1 factor shift and go to state 1
term shift and go to state 4 term shift and go to state 4
expr shift and go to state 6 expr shift and go to state 6
@ -1127,7 +1127,7 @@ usually track down the source of most parsing conflicts. It should
also be stressed that not all shift-reduce conflicts are bad. also be stressed that not all shift-reduce conflicts are bad.
However, the only way to be sure that they are resolved correctly is However, the only way to be sure that they are resolved correctly is
to look at the debugging file. to look at the debugging file.
Syntax Error Handling Syntax Error Handling
^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^
@ -1212,7 +1212,7 @@ appear as the last token on the right in an error rule. For example::
This is because the first bad token encountered will cause the rule to This is because the first bad token encountered will cause the rule to
be reduced--which may make it difficult to recover if more bad tokens be reduced--which may make it difficult to recover if more bad tokens
immediately follow. It's better to have some kind of landmark such as immediately follow. It's better to have some kind of landmark such as
a semicolon, closing parenthesese, or other token that can be used as a semicolon, closing parentheses, or other token that can be used as
a synchronization point. a synchronization point.
Panic mode recovery Panic mode recovery
@ -1236,7 +1236,7 @@ state::
# Read ahead looking for a closing '}' # Read ahead looking for a closing '}'
while True: while True:
tok = next(self.tokens, None) tok = next(self.tokens, None)
if not tok or tok.type == 'RBRACE': if not tok or tok.type == 'RBRACE':
break break
self.restart() self.restart()
@ -1271,12 +1271,12 @@ useful if trying to synchronize on special characters. For example::
# Read ahead looking for a terminating ";" # Read ahead looking for a terminating ";"
while True: while True:
tok = next(self.tokens, None) # Get the next token tok = next(self.tokens, None) # Get the next token
if not tok or tok.type == 'SEMI': if not tok or tok.type == 'SEMI':
break break
self.errok() self.errok()
# Return SEMI to the parser as the next lookahead token # Return SEMI to the parser as the next lookahead token
return tok return tok
When Do Syntax Errors Get Reported? When Do Syntax Errors Get Reported?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -1339,7 +1339,7 @@ are many possible ways to do this, but one example is something
like this:: like this::
@_('expr PLUS expr', @_('expr PLUS expr',
'expr MINUS expr', 'expr MINUS expr',
'expr TIMES expr', 'expr TIMES expr',
'expr DIVIDE expr') 'expr DIVIDE expr')
def expr(self, p): def expr(self, p):
@ -1357,7 +1357,7 @@ Another approach is to create a set of data structure for different
kinds of abstract syntax tree nodes and create different node types kinds of abstract syntax tree nodes and create different node types
in each rule:: in each rule::
class Expr: class Expr:
pass pass
class BinOp(Expr): class BinOp(Expr):
@ -1371,7 +1371,7 @@ in each rule::
self.value = value self.value = value
@_('expr PLUS expr', @_('expr PLUS expr',
'expr MINUS expr', 'expr MINUS expr',
'expr TIMES expr', 'expr TIMES expr',
'expr DIVIDE expr') 'expr DIVIDE expr')
def expr(self, p): def expr(self, p):
@ -1494,7 +1494,7 @@ C code, you might write code like this::
# Action code # Action code
... ...
pop_scope() # Return to previous scope pop_scope() # Return to previous scope
@_('') @_('')
def new_scope(self, p): def new_scope(self, p):
# Create a new scope for local variables # Create a new scope for local variables