2018-01-16 15:30:09 +01:00
|
|
|
Version 0.3
|
|
|
|
-----------
|
2018-04-02 03:06:27 +02:00
|
|
|
4/1/2018 Support for Lexer inheritance added. For example:
|
|
|
|
|
|
|
|
from sly import Lexer
|
|
|
|
|
|
|
|
class BaseLexer(Lexer):
|
|
|
|
tokens = { NAME, NUMBER }
|
|
|
|
ignore = ' \t'
|
|
|
|
|
|
|
|
NAME = r'[a-zA-Z]+'
|
|
|
|
NUMBER = r'\d+'
|
|
|
|
|
|
|
|
|
|
|
|
class ChildLexer(BaseLexer):
|
|
|
|
tokens = { PLUS, MINUS }
|
|
|
|
PLUS = r'\+'
|
|
|
|
MINUS = r'-'
|
|
|
|
|
|
|
|
In this example, the ChildLexer class gets all of the tokens
|
|
|
|
from the parent class (BaseLexer) in addition to the new
|
|
|
|
definitions it added of its own.
|
|
|
|
|
|
|
|
One quirk of Lexer inheritance is that definition order has
|
|
|
|
an impact on the low-level regular expression parsing. By
|
|
|
|
default new definitions are always processed AFTER any previous
|
|
|
|
definitions. You can change this using the before() function
|
|
|
|
like this:
|
|
|
|
|
|
|
|
class GrandChildLexer(ChildLexer):
|
|
|
|
tokens = { PLUSPLUS, MINUSMINUS }
|
|
|
|
PLUSPLUS = before(PLUS, r'\+\+')
|
|
|
|
MINUSMINUS = before(MINUS, r'--')
|
|
|
|
|
|
|
|
In this example, the PLUSPLUS token is checked before the
|
|
|
|
PLUS token in the base class. Thus, an input text of '++'
|
|
|
|
will be parsed as a single token PLUSPLUS, not two PLUS tokens.
|
|
|
|
|
2018-07-07 20:54:42 +02:00
|
|
|
4/1/2018 Better support for lexing states. Each lexing state can be defined as
|
2018-04-02 03:06:27 +02:00
|
|
|
as a separate class. Use the begin(cls) method to switch to a
|
|
|
|
different state. For example:
|
|
|
|
|
|
|
|
from sly import Lexer
|
|
|
|
|
|
|
|
class LexerA(Lexer):
|
|
|
|
tokens = { NAME, NUMBER, LBRACE }
|
|
|
|
|
|
|
|
ignore = ' \t'
|
|
|
|
|
|
|
|
NAME = r'[a-zA-Z]+'
|
|
|
|
NUMBER = r'\d+'
|
|
|
|
LBRACE = r'\{'
|
|
|
|
|
|
|
|
def LBRACE(self, t):
|
|
|
|
self.begin(LexerB)
|
|
|
|
return t
|
|
|
|
|
|
|
|
class LexerB(Lexer):
|
|
|
|
tokens = { PLUS, MINUS, RBRACE }
|
|
|
|
|
|
|
|
ignore = ' \t'
|
|
|
|
|
|
|
|
PLUS = r'\+'
|
|
|
|
MINUS = r'-'
|
|
|
|
RBRACE = r'\}'
|
|
|
|
|
|
|
|
def RBRACE(self, t):
|
|
|
|
self.begin(LexerA)
|
|
|
|
return t
|
|
|
|
|
|
|
|
In this example, LexerA switches to a new state LexerB when
|
|
|
|
a left brace ({) is encountered. The begin() method causes
|
|
|
|
the state transition. LexerB switches back to state LexerA
|
|
|
|
when a right brace (}) is encountered.
|
|
|
|
|
|
|
|
An option to the begin() method, you can also use push_state(cls)
|
|
|
|
and pop_state(cls) methods. This manages the lexing states as a
|
|
|
|
stack. The pop_state() method will return back to the previous
|
|
|
|
lexing state.
|
|
|
|
|
2018-01-27 22:27:15 +01:00
|
|
|
1/27/2018 Tokens no longer have to be specified as strings. For example, you
|
|
|
|
can now write:
|
|
|
|
|
|
|
|
from sly import Lexer
|
|
|
|
|
|
|
|
class TheLexer(Lexer):
|
|
|
|
tokens = { ID, NUMBER, PLUS, MINUS }
|
|
|
|
|
|
|
|
ID = r'[a-zA-Z_][a-zA-Z0-9_]*'
|
|
|
|
NUMBER = r'\d+'
|
|
|
|
PLUS = r'\+'
|
|
|
|
MINUS = r'-'
|
|
|
|
|
|
|
|
This convention also carries over to the parser for things such
|
|
|
|
as precedence specifiers:
|
|
|
|
|
|
|
|
from sly import Parser
|
|
|
|
class TheParser(Parser):
|
|
|
|
tokens = TheLexer.tokens
|
|
|
|
|
|
|
|
precedence = (
|
|
|
|
('left', PLUS, MINUS),
|
|
|
|
('left', TIMES, DIVIDE),
|
|
|
|
('right', UMINUS),
|
|
|
|
)
|
|
|
|
...
|
|
|
|
|
|
|
|
Nevermind the fact that ID, NUMBER, PLUS, and MINUS appear to be
|
|
|
|
undefined identifiers. It all works.
|
|
|
|
|
|
|
|
1/27/2018 Tokens now allow special-case remapping. For example:
|
|
|
|
|
|
|
|
from sly import Lexer
|
|
|
|
|
|
|
|
class TheLexer(Lexer):
|
|
|
|
tokens = { ID, IF, ELSE, WHILE, NUMBER, PLUS, MINUS }
|
|
|
|
|
|
|
|
ID = r'[a-zA-Z_][a-zA-Z0-9_]*'
|
|
|
|
ID['if'] = IF
|
|
|
|
ID['else'] = ELSE
|
|
|
|
ID['while'] = WHILE
|
|
|
|
|
|
|
|
NUMBER = r'\d+'
|
|
|
|
PLUS = r'\+'
|
|
|
|
MINUS = r'-'
|
|
|
|
|
|
|
|
In this code, the ID rule matches any identifier. However,
|
|
|
|
special cases have been made for IF, ELSE, and WHILE tokens.
|
|
|
|
Previously, this had to be handled in a special action method
|
|
|
|
such as this:
|
|
|
|
|
|
|
|
def ID(self, t):
|
|
|
|
if t.value in { 'if', 'else', 'while' }:
|
|
|
|
t.type = t.value.upper()
|
|
|
|
return t
|
|
|
|
|
|
|
|
Nevermind the fact that the syntax appears to suggest that strings
|
|
|
|
work as a kind of mutable mapping.
|
|
|
|
|
2018-01-16 15:30:09 +01:00
|
|
|
1/16/2018 Usability improvement on Lexer class. Regular expression rules
|
|
|
|
specified as strings that don't match any name in tokens are
|
|
|
|
now reported as errors.
|
|
|
|
|
2018-01-10 13:09:20 +01:00
|
|
|
Version 0.2
|
|
|
|
-----------
|
|
|
|
|
|
|
|
12/24/2017 The error(self, t) method of lexer objects now receives a
|
|
|
|
token as input. The value attribute of this token contains
|
|
|
|
all remaining input text. If the passed token is returned
|
|
|
|
by error(), then it shows up in the token stream where
|
|
|
|
can be processed by the parser.
|