doc update

2016-09-07 17:59:09 -05:00 · 2016-09-07 17:59:09 -05:00 · fe97ffc0fd
commit fe97ffc0fd
parent 9d96455bdf
2 changed files with 141 additions and 107 deletions
--- a/docs/sly.rst
+++ b/docs/sly.rst
@ -50,22 +50,22 @@ following input string::
    x = 3 + 42 * (s - t)
-A tokenizer splits the string into individual tokens where each token
+The first step of any parsing is to break the text into tokens where
-has a name and value.  For example, the above text might be described
+each token has a type and value. For example, the above text might be
-by the following token list::
+described by the following list of token tuples::
    [ ('ID','x'), ('EQUALS','='), ('NUMBER','3'), 
-      ('PLUS','+'), ('NUMBER','42), ('TIMES','*'),
+      ('PLUS','+'), ('NUMBER','42'), ('TIMES','*'),
      ('LPAREN','('), ('ID','s'), ('MINUS','-'),
      ('ID','t'), ('RPAREN',')' ]
-The ``Lexer`` class is used to do this.   Here is a sample of a simple
+The SLY ``Lexer`` class is used to do this.   Here is a sample of a simple
-tokenizer::
+lexer::
    # ------------------------------------------------------------
    # calclex.py
    #
-    # tokenizer for a simple expression evaluator for
+    # Lexer for a simple expression evaluator for
    # numbers and +,-,*,/
    # ------------------------------------------------------------
@ -83,7 +83,7 @@ tokenizer::
            'RPAREN',
            )
-        # String containining ignored characters (spaces and tabs)
+        # String containing ignored characters (spaces and tabs)
        ignore = ' \t'
        # Regular expression rules for simple tokens
@ -107,7 +107,8 @@ tokenizer::
        # Error handling rule (skips ahead one character)
        def error(self, value):
-            print("Line %d: Illegal character '%s'" % (self.lineno, value[0]))
+            print("Line %d: Illegal character '%s'" %
 	          (self.lineno, value[0]))
            self.index += 1
    if __name__ == '__main__':
@ -134,22 +135,18 @@ When executed, the example will produce the following output::
    Line 3: Illegal character '^'
    Token(NUMBER, 2, 3, 50)
-The tokens produced by the ``lexer.tokenize()`` methods are instances
+A lexer only has one public method ``tokenize()``.  This is a generator
-of type ``Token``.  The ``type`` and ``value`` attributes contain the
+function that produces a stream of ``Token`` instances.
-token name and value respectively.  The ``lineno`` and ``index``
+The ``type`` and ``value`` attributes of ``Token`` contain the
 token type name and value respectively.  The ``lineno`` and ``index``
 attributes contain the line number and position in the input text
-where the token appears. Here is an example of accessing these
+where the token appears. 
 attributes::
    for tok in lexer.tokenize(data):
        print(tok.type, tok.value,  tok.lineno, tok.index)
 The tokens list
---------------
+^^^^^^^^^^^^^^^
-All lexers must provide a list ``tokens`` that defines all of the possible token
+Lexers must specify a ``tokens`` attribute that defines all of the possible token
-names that can be produced by the lexer.  This list is always required
+type names that can be produced by the lexer.  This list is always required
 and is used to perform a variety of validation checks.  
 In the example, the following code specified the token names::
@ -169,52 +166,67 @@ In the example, the following code specified the token names::
        ...
 Specification of tokens
-----------------------
+^^^^^^^^^^^^^^^^^^^^^^^
-Each token is specified by writing a regular expression rule compatible with Python's ``re`` module.  Each of these rules
+
-are defined by making declarations that match the names of the tokens provided in the tokens list.
+Tokens are specified by writing a regular expression rule compatible
-For simple tokens, the regular expression is specified as a string such as this::
+with Python's ``re`` module.  This is done by writing definitions that
 match one of the names of the tokens provided in the ``tokens``
 attribute.  For example::
    PLUS = r'\+'
    MINUS = r'-'
-If some kind of action needs to be performed when a token is matched,
+Sometimes you want to perform an action when a token is matched.  For example,
-a token rule can be specified as a function.  In this case, the
+maybe you want to convert a numeric value or look up a symbol.  To do
-associated regular expression is given using the ``@_` decorator like
+this, write your action as a method and give the associated regular
-this::
+expression using the ``@_()`` decorator like this::
    @_(r'\d+')
    def NUMBER(self, t):
        t.value = int(t.value)
        return t
-The function always takes a single argument which is an instance of
+The method always takes a single argument which is an instance of
-``Token``.  By default, ``t.type`` is set to the name of the
+``Token``.  By default, ``t.type`` is set to the name of the token
-definition (e.g., ``'NUMBER'``).  The function can change the token
+(e.g., ``'NUMBER'``).  The function can change the token type and
-type and value as it sees appropriate.  When finished, the resulting
+value as it sees appropriate.  When finished, the resulting token
-token object should be returned. If no value is returned by the
+object should be returned as a result. If no value is returned by the
 function, the token is simply discarded and the next token read.
-Internally, the ``Lexer`` class uses the ``re`` module to do its pattern matching.  Patterns are compiled
+Internally, the ``Lexer`` class uses the ``re`` module to do its
-using the ``re.VERBOSE`` flag which can be used to help readability.  However, be aware that unescaped
+pattern matching.  Patterns are compiled using the ``re.VERBOSE`` flag
-whitespace is ignored and comments are allowed in this mode.  If your pattern involves whitespace, make sure you
+which can be used to help readability.  However, be aware that
-use ``\s``.  If you need to match the ``#`` character, use ``[#]``.
+unescaped whitespace is ignored and comments are allowed in this mode.
 If your pattern involves whitespace, make sure you use ``\s``.  If you
 need to match the ``#`` character, use ``[#]``.
-When building the master regular expression, rules are added in the
+Controlling Match Order
-same order as they are listed in the ``Lexer`` class.  Be aware that
+^^^^^^^^^^^^^^^^^^^^^^^
-longer tokens may need to be specified before short tokens.  For
+
-example, if you wanted to have separate tokens for "=" and "==", you
+Tokens are matched in the same order as patterns are listed in the
-need to make sure that "==" is listed first.
+``Lexer`` class.  Be aware that longer tokens may need to be specified
 before short tokens.  For example, if you wanted to have separate
 tokens for "=" and "==", you need to make sure that "==" is listed
 first.  For example::
    class MyLexer(Lexer):
        tokens = ('ASSIGN', 'EQUALTO', ...)
        ...
        EQUALTO = r'=='       # MUST APPEAR FIRST!
        ASSIGN  = r'='
 To handle reserved words, you should write a single rule to match an
 identifier and do a special name lookup in a function like this::
-    class CalcLexer(Lexer):
+    class MyLexer(Lexer):
        reserved = { 'if', 'then', 'else', 'while' }
        tokens = ['LPAREN','RPAREN',...,'ID'] + [ w.upper() for w in reserved ]
        @_(r'[a-zA-Z_][a-zA-Z_0-9]*')
        def ID(self, t):
            # Check to see if the name is a reserved word
            # If so, change its type.
            if t.value in self.reserved:
                t.type = t.value.upper()
            return t
@ -226,25 +238,25 @@ For example, suppose you wrote rules like this::
    PRINT = r'print'
 In this case, the rules will be triggered for identifiers that include
-those words as a prefix such as "forget" or "printed".  This is
+those words as a prefix such as "forget" or "printed".  
-probably not what you want.
+This is probably not what you want.
-Discarded tokens
+Discarded text
----------------
+^^^^^^^^^^^^^^
-To discard a token, such as a comment, simply define a token rule that returns no value.  For example::
+To discard text, such as a comment, simply define a token rule that returns no value.  For example::
    @_(r'\#.*')
    def COMMENT(self, t):
        pass
        # No return value. Token discarded
-Alternatively, you can include the prefix "ignore_" in the token declaration to force a token to be ignored.  For example:
+Alternatively, you can include the prefix ``ignore_`` in a token
 declaration to force a token to be ignored.  For example:
    ignore_COMMENT = r'\#.*'
-
+Line numbers and position tracking
-Line numbers and positional information
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 ---------------------------------------
 By default, lexers know nothing about line numbers.  This is because
 they don't know anything about what constitutes a "line" of input
@ -265,7 +277,7 @@ Lexers do not perform and kind of automatic column tracking.  However,
 it does record positional information related to each token in the
 ``index`` attribute.  Using this, it is usually possible to compute
 column information as a separate step.  For instance, you could count
-backwards until you reach a newline::
+backwards until you reach the previous newline::
    # Compute column. 
    #     input is the input text string
@ -279,19 +291,19 @@ backwards until you reach a newline::
 Since column information is often only useful in the context of error
 handling, calculating the column position can be performed when needed
-as opposed to doing it for each token.
+as opposed to including it on each token.
 Ignored characters
------------------
+^^^^^^^^^^^^^^^^^^
-The special ``ignore`` rule is reserved for characters that should be
+The special ``ignore`` specification is reserved for characters that
-completely ignored in the input stream.  Usually this is used to skip
+should be completely ignored in the input stream.  Usually this is
-over whitespace and other non-essential characters.  Although it is
+used to skip over whitespace and other non-essential characters.
-possible to define a regular expression rule for whitespace in a
+Although it is possible to define a regular expression rule for
-manner similar to ``newline()``, the use of ``ignore`` provides
+whitespace in a manner similar to ``newline()``, the use of ``ignore``
-substantially better lexing performance because it is handled as a
+provides substantially better lexing performance because it is handled
-special case and is checked in a much more efficient manner than the
+as a special case and is checked in a much more efficient manner than
-normal regular expression rules.
+the normal regular expression rules.
 The characters given in ``ignore`` are not ignored when such
 characters are part of other regular expression patterns.  For
@ -301,10 +313,10 @@ way).  The main purpose of ``ignore`` is to ignore whitespace and
 other padding between the tokens that you actually want to parse.
 Literal characters
------------------
+^^^^^^^^^^^^^^^^^^
-Literal characters can be specified by defining a variable ``literals`` in the class.
+Literal characters can be specified by defining a variable
-For example::
+``literals`` in the class.  For example::
     class MyLexer(Lexer):
         ...
@ -319,7 +331,7 @@ of the literal characters, it will always take precedence.
 When a literal token is returned, both its ``type`` and ``value``
 attributes are set to the character itself. For example, ``'+'``.
-It's possible to write token functions that perform additional actions
+It's possible to write token methods that perform additional actions
 when literals are matched.  However, you'll need to set the token type
 appropriately. For example::
@ -327,55 +339,74 @@ appropriately. For example::
          literals = [ '{', '}' ]
          def __init__(self):
              self.indentation_level = 0
          @_(r'\{')
          def lbrace(self, t):
              t.type = '{'      # Set token type to the expected literal
 	      self.indentation_level += 1
              return t
          @_(r'\}')
          def rbrace(t):
              t.type = '}'      # Set token type to the expected literal
 	      self.indentation_level -=1
              return t
 Error handling
--------------
+^^^^^^^^^^^^^^
-The ``error()``
+The ``error()`` method is used to handle lexing errors that occur when
-function is used to handle lexing errors that occur when illegal
+illegal characters are detected.  The error method receives a string
-characters are detected.  The error function receives a string containing
+containing all remaining untokenized text.  A typical handler might
-all remaining untokenized text.  A typical handler might skip ahead 
+look at this text and skip ahead in some manner.  For example::
 in the input. For example::
    class MyLexer(Lexer):
        ...
        # Error handling rule
        def error(self, value):
            print("Illegal character '%s'" % value[0])
            self.index += 1
-In this case, we simply print the offending character and skip ahead one character by updating the
+In this case, we simply print the offending character and skip ahead
-lexer position.
+one character by updating the lexer position.   Error handling in a
 parser is often a hard problem.  An error handler might scan ahead
 to a logical synchronization point such as a semicolon, a blank line,
 or similar landmark.
 EOF Handling
------------
+^^^^^^^^^^^^
-An optional ``eof()`` method can be used to handle an end-of-file (EOF) condition in the input.   
+The lexer will produce tokens until it reaches the end of the supplied
-Write me::
+input string.  An optional ``eof()`` method can be used to handle an
 end-of-file (EOF) condition in the input.  For example::
    class MyLexer(Lexer):
        ...
        # EOF handling rule
        def eof(self):
            # Get more input (Example)
-        more = raw_input('... ')
+            more = input('more > ')
-        if more:
+            return more
            self.lexer.input(more)
            return self.lexer.token()
        return None
-Maintaining state
+The ``eof()`` method should return a string as a result.  Be aware
-----------------
+that reading input in chunks may require great attention to the
-In your lexer, you may want to maintain a variety of state
+handling of chunk boundaries.  Specifically, you can't break the text
 such that a chunk boundary appears in the middle of a token (for
 example, splitting input in the middle of a quoted string). For
 this reason, you might have to do some additional framing 
 of the data such as splitting into lines or blocks to make it work. 
 Maintaining extra state
 ^^^^^^^^^^^^^^^^^^^^^^^
 In your lexer, you may want to maintain a variety of other state
 information.  This might include mode settings, symbol tables, and
-other details.  As an example, suppose that you wanted to keep
+other details.  As an example, suppose that you wanted to keep track
-track of how many NUMBER tokens had been encountered.  
+of how many NUMBER tokens had been encountered.  You can do this by
-You can do this by adding an ``__init__()`` method. For example::
+adding an ``__init__()`` method and adding more attributes. For
 example::
    class MyLexer(Lexer):
        def __init__(self):
@ -387,3 +418,6 @@ class MyLexer(Lexer):
            t.value = int(t.value)    
            return t
 Please note that lexers already use the ``lineno`` and ``position``
 attributes during parsing.
--- a/sly/lex.py
+++ b/sly/lex.py
@ -222,7 +222,7 @@ class Lexer(metaclass=LexerMeta):
                except IndexError:
                    if self.eof:
                        text = self.eof()
-                        if text is not None:
+                        if text:
                            index = 0
                            continue
                    break