doc update

This commit is contained in:
David Beazley 2016-09-07 17:59:09 -05:00
parent 9d96455bdf
commit fe97ffc0fd
2 changed files with 141 additions and 107 deletions

View File

@ -50,22 +50,22 @@ following input string::
x = 3 + 42 * (s - t) x = 3 + 42 * (s - t)
A tokenizer splits the string into individual tokens where each token The first step of any parsing is to break the text into tokens where
has a name and value. For example, the above text might be described each token has a type and value. For example, the above text might be
by the following token list:: described by the following list of token tuples::
[ ('ID','x'), ('EQUALS','='), ('NUMBER','3'), [ ('ID','x'), ('EQUALS','='), ('NUMBER','3'),
('PLUS','+'), ('NUMBER','42), ('TIMES','*'), ('PLUS','+'), ('NUMBER','42'), ('TIMES','*'),
('LPAREN','('), ('ID','s'), ('MINUS','-'), ('LPAREN','('), ('ID','s'), ('MINUS','-'),
('ID','t'), ('RPAREN',')' ] ('ID','t'), ('RPAREN',')' ]
The ``Lexer`` class is used to do this. Here is a sample of a simple The SLY ``Lexer`` class is used to do this. Here is a sample of a simple
tokenizer:: lexer::
# ------------------------------------------------------------ # ------------------------------------------------------------
# calclex.py # calclex.py
# #
# tokenizer for a simple expression evaluator for # Lexer for a simple expression evaluator for
# numbers and +,-,*,/ # numbers and +,-,*,/
# ------------------------------------------------------------ # ------------------------------------------------------------
@ -83,7 +83,7 @@ tokenizer::
'RPAREN', 'RPAREN',
) )
# String containining ignored characters (spaces and tabs) # String containing ignored characters (spaces and tabs)
ignore = ' \t' ignore = ' \t'
# Regular expression rules for simple tokens # Regular expression rules for simple tokens
@ -107,7 +107,8 @@ tokenizer::
# Error handling rule (skips ahead one character) # Error handling rule (skips ahead one character)
def error(self, value): def error(self, value):
print("Line %d: Illegal character '%s'" % (self.lineno, value[0])) print("Line %d: Illegal character '%s'" %
(self.lineno, value[0]))
self.index += 1 self.index += 1
if __name__ == '__main__': if __name__ == '__main__':
@ -134,22 +135,18 @@ When executed, the example will produce the following output::
Line 3: Illegal character '^' Line 3: Illegal character '^'
Token(NUMBER, 2, 3, 50) Token(NUMBER, 2, 3, 50)
The tokens produced by the ``lexer.tokenize()`` methods are instances A lexer only has one public method ``tokenize()``. This is a generator
of type ``Token``. The ``type`` and ``value`` attributes contain the function that produces a stream of ``Token`` instances.
token name and value respectively. The ``lineno`` and ``index`` The ``type`` and ``value`` attributes of ``Token`` contain the
token type name and value respectively. The ``lineno`` and ``index``
attributes contain the line number and position in the input text attributes contain the line number and position in the input text
where the token appears. Here is an example of accessing these where the token appears.
attributes::
for tok in lexer.tokenize(data):
print(tok.type, tok.value, tok.lineno, tok.index)
The tokens list The tokens list
--------------- ^^^^^^^^^^^^^^^
All lexers must provide a list ``tokens`` that defines all of the possible token Lexers must specify a ``tokens`` attribute that defines all of the possible token
names that can be produced by the lexer. This list is always required type names that can be produced by the lexer. This list is always required
and is used to perform a variety of validation checks. and is used to perform a variety of validation checks.
In the example, the following code specified the token names:: In the example, the following code specified the token names::
@ -169,52 +166,67 @@ In the example, the following code specified the token names::
... ...
Specification of tokens Specification of tokens
----------------------- ^^^^^^^^^^^^^^^^^^^^^^^
Each token is specified by writing a regular expression rule compatible with Python's ``re`` module. Each of these rules
are defined by making declarations that match the names of the tokens provided in the tokens list. Tokens are specified by writing a regular expression rule compatible
For simple tokens, the regular expression is specified as a string such as this:: with Python's ``re`` module. This is done by writing definitions that
match one of the names of the tokens provided in the ``tokens``
attribute. For example::
PLUS = r'\+' PLUS = r'\+'
MINUS = r'-' MINUS = r'-'
If some kind of action needs to be performed when a token is matched, Sometimes you want to perform an action when a token is matched. For example,
a token rule can be specified as a function. In this case, the maybe you want to convert a numeric value or look up a symbol. To do
associated regular expression is given using the ``@_` decorator like this, write your action as a method and give the associated regular
this:: expression using the ``@_()`` decorator like this::
@_(r'\d+') @_(r'\d+')
def NUMBER(self, t): def NUMBER(self, t):
t.value = int(t.value) t.value = int(t.value)
return t return t
The function always takes a single argument which is an instance of The method always takes a single argument which is an instance of
``Token``. By default, ``t.type`` is set to the name of the ``Token``. By default, ``t.type`` is set to the name of the token
definition (e.g., ``'NUMBER'``). The function can change the token (e.g., ``'NUMBER'``). The function can change the token type and
type and value as it sees appropriate. When finished, the resulting value as it sees appropriate. When finished, the resulting token
token object should be returned. If no value is returned by the object should be returned as a result. If no value is returned by the
function, the token is simply discarded and the next token read. function, the token is simply discarded and the next token read.
Internally, the ``Lexer`` class uses the ``re`` module to do its pattern matching. Patterns are compiled Internally, the ``Lexer`` class uses the ``re`` module to do its
using the ``re.VERBOSE`` flag which can be used to help readability. However, be aware that unescaped pattern matching. Patterns are compiled using the ``re.VERBOSE`` flag
whitespace is ignored and comments are allowed in this mode. If your pattern involves whitespace, make sure you which can be used to help readability. However, be aware that
use ``\s``. If you need to match the ``#`` character, use ``[#]``. unescaped whitespace is ignored and comments are allowed in this mode.
If your pattern involves whitespace, make sure you use ``\s``. If you
need to match the ``#`` character, use ``[#]``.
When building the master regular expression, rules are added in the Controlling Match Order
same order as they are listed in the ``Lexer`` class. Be aware that ^^^^^^^^^^^^^^^^^^^^^^^
longer tokens may need to be specified before short tokens. For
example, if you wanted to have separate tokens for "=" and "==", you Tokens are matched in the same order as patterns are listed in the
need to make sure that "==" is listed first. ``Lexer`` class. Be aware that longer tokens may need to be specified
before short tokens. For example, if you wanted to have separate
tokens for "=" and "==", you need to make sure that "==" is listed
first. For example::
class MyLexer(Lexer):
tokens = ('ASSIGN', 'EQUALTO', ...)
...
EQUALTO = r'==' # MUST APPEAR FIRST!
ASSIGN = r'='
To handle reserved words, you should write a single rule to match an To handle reserved words, you should write a single rule to match an
identifier and do a special name lookup in a function like this:: identifier and do a special name lookup in a function like this::
class CalcLexer(Lexer): class MyLexer(Lexer):
reserved = { 'if', 'then', 'else', 'while' } reserved = { 'if', 'then', 'else', 'while' }
tokens = ['LPAREN','RPAREN',...,'ID'] + [ w.upper() for w in reserved ] tokens = ['LPAREN','RPAREN',...,'ID'] + [ w.upper() for w in reserved ]
@_(r'[a-zA-Z_][a-zA-Z_0-9]*') @_(r'[a-zA-Z_][a-zA-Z_0-9]*')
def ID(self, t): def ID(self, t):
# Check to see if the name is a reserved word
# If so, change its type.
if t.value in self.reserved: if t.value in self.reserved:
t.type = t.value.upper() t.type = t.value.upper()
return t return t
@ -226,25 +238,25 @@ For example, suppose you wrote rules like this::
PRINT = r'print' PRINT = r'print'
In this case, the rules will be triggered for identifiers that include In this case, the rules will be triggered for identifiers that include
those words as a prefix such as "forget" or "printed". This is those words as a prefix such as "forget" or "printed".
probably not what you want. This is probably not what you want.
Discarded tokens Discarded text
---------------- ^^^^^^^^^^^^^^
To discard a token, such as a comment, simply define a token rule that returns no value. For example:: To discard text, such as a comment, simply define a token rule that returns no value. For example::
@_(r'\#.*') @_(r'\#.*')
def COMMENT(self, t): def COMMENT(self, t):
pass pass
# No return value. Token discarded # No return value. Token discarded
Alternatively, you can include the prefix "ignore_" in the token declaration to force a token to be ignored. For example: Alternatively, you can include the prefix ``ignore_`` in a token
declaration to force a token to be ignored. For example:
ignore_COMMENT = r'\#.*' ignore_COMMENT = r'\#.*'
Line numbers and position tracking
Line numbers and positional information ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
---------------------------------------
By default, lexers know nothing about line numbers. This is because By default, lexers know nothing about line numbers. This is because
they don't know anything about what constitutes a "line" of input they don't know anything about what constitutes a "line" of input
@ -265,7 +277,7 @@ Lexers do not perform and kind of automatic column tracking. However,
it does record positional information related to each token in the it does record positional information related to each token in the
``index`` attribute. Using this, it is usually possible to compute ``index`` attribute. Using this, it is usually possible to compute
column information as a separate step. For instance, you could count column information as a separate step. For instance, you could count
backwards until you reach a newline:: backwards until you reach the previous newline::
# Compute column. # Compute column.
# input is the input text string # input is the input text string
@ -279,19 +291,19 @@ backwards until you reach a newline::
Since column information is often only useful in the context of error Since column information is often only useful in the context of error
handling, calculating the column position can be performed when needed handling, calculating the column position can be performed when needed
as opposed to doing it for each token. as opposed to including it on each token.
Ignored characters Ignored characters
------------------ ^^^^^^^^^^^^^^^^^^
The special ``ignore`` rule is reserved for characters that should be The special ``ignore`` specification is reserved for characters that
completely ignored in the input stream. Usually this is used to skip should be completely ignored in the input stream. Usually this is
over whitespace and other non-essential characters. Although it is used to skip over whitespace and other non-essential characters.
possible to define a regular expression rule for whitespace in a Although it is possible to define a regular expression rule for
manner similar to ``newline()``, the use of ``ignore`` provides whitespace in a manner similar to ``newline()``, the use of ``ignore``
substantially better lexing performance because it is handled as a provides substantially better lexing performance because it is handled
special case and is checked in a much more efficient manner than the as a special case and is checked in a much more efficient manner than
normal regular expression rules. the normal regular expression rules.
The characters given in ``ignore`` are not ignored when such The characters given in ``ignore`` are not ignored when such
characters are part of other regular expression patterns. For characters are part of other regular expression patterns. For
@ -301,10 +313,10 @@ way). The main purpose of ``ignore`` is to ignore whitespace and
other padding between the tokens that you actually want to parse. other padding between the tokens that you actually want to parse.
Literal characters Literal characters
------------------ ^^^^^^^^^^^^^^^^^^
Literal characters can be specified by defining a variable ``literals`` in the class. Literal characters can be specified by defining a variable
For example:: ``literals`` in the class. For example::
class MyLexer(Lexer): class MyLexer(Lexer):
... ...
@ -319,7 +331,7 @@ of the literal characters, it will always take precedence.
When a literal token is returned, both its ``type`` and ``value`` When a literal token is returned, both its ``type`` and ``value``
attributes are set to the character itself. For example, ``'+'``. attributes are set to the character itself. For example, ``'+'``.
It's possible to write token functions that perform additional actions It's possible to write token methods that perform additional actions
when literals are matched. However, you'll need to set the token type when literals are matched. However, you'll need to set the token type
appropriately. For example:: appropriately. For example::
@ -327,55 +339,74 @@ appropriately. For example::
literals = [ '{', '}' ] literals = [ '{', '}' ]
def __init__(self):
self.indentation_level = 0
@_(r'\{') @_(r'\{')
def lbrace(self, t): def lbrace(self, t):
t.type = '{' # Set token type to the expected literal t.type = '{' # Set token type to the expected literal
self.indentation_level += 1
return t return t
@_(r'\}') @_(r'\}')
def rbrace(t): def rbrace(t):
t.type = '}' # Set token type to the expected literal t.type = '}' # Set token type to the expected literal
self.indentation_level -=1
return t return t
Error handling Error handling
-------------- ^^^^^^^^^^^^^^
The ``error()`` The ``error()`` method is used to handle lexing errors that occur when
function is used to handle lexing errors that occur when illegal illegal characters are detected. The error method receives a string
characters are detected. The error function receives a string containing containing all remaining untokenized text. A typical handler might
all remaining untokenized text. A typical handler might skip ahead look at this text and skip ahead in some manner. For example::
in the input. For example::
class MyLexer(Lexer):
...
# Error handling rule # Error handling rule
def error(self, value): def error(self, value):
print("Illegal character '%s'" % value[0]) print("Illegal character '%s'" % value[0])
self.index += 1 self.index += 1
In this case, we simply print the offending character and skip ahead one character by updating the In this case, we simply print the offending character and skip ahead
lexer position. one character by updating the lexer position. Error handling in a
parser is often a hard problem. An error handler might scan ahead
to a logical synchronization point such as a semicolon, a blank line,
or similar landmark.
EOF Handling EOF Handling
------------ ^^^^^^^^^^^^
An optional ``eof()`` method can be used to handle an end-of-file (EOF) condition in the input. The lexer will produce tokens until it reaches the end of the supplied
Write me:: input string. An optional ``eof()`` method can be used to handle an
end-of-file (EOF) condition in the input. For example::
class MyLexer(Lexer):
...
# EOF handling rule # EOF handling rule
def eof(self): def eof(self):
# Get more input (Example) # Get more input (Example)
more = raw_input('... ') more = input('more > ')
if more: return more
self.lexer.input(more)
return self.lexer.token()
return None
Maintaining state The ``eof()`` method should return a string as a result. Be aware
----------------- that reading input in chunks may require great attention to the
In your lexer, you may want to maintain a variety of state handling of chunk boundaries. Specifically, you can't break the text
such that a chunk boundary appears in the middle of a token (for
example, splitting input in the middle of a quoted string). For
this reason, you might have to do some additional framing
of the data such as splitting into lines or blocks to make it work.
Maintaining extra state
^^^^^^^^^^^^^^^^^^^^^^^
In your lexer, you may want to maintain a variety of other state
information. This might include mode settings, symbol tables, and information. This might include mode settings, symbol tables, and
other details. As an example, suppose that you wanted to keep other details. As an example, suppose that you wanted to keep track
track of how many NUMBER tokens had been encountered. of how many NUMBER tokens had been encountered. You can do this by
You can do this by adding an ``__init__()`` method. For example:: adding an ``__init__()`` method and adding more attributes. For
example::
class MyLexer(Lexer): class MyLexer(Lexer):
def __init__(self): def __init__(self):
@ -387,3 +418,6 @@ class MyLexer(Lexer):
t.value = int(t.value) t.value = int(t.value)
return t return t
Please note that lexers already use the ``lineno`` and ``position``
attributes during parsing.

View File

@ -222,7 +222,7 @@ class Lexer(metaclass=LexerMeta):
except IndexError: except IndexError:
if self.eof: if self.eof:
text = self.eof() text = self.eof()
if text is not None: if text:
index = 0 index = 0
continue continue
break break