Info I wanna keep around just because

Some of this is actual like ... historical information for the project. A lot of it is just ... stuff I wrote down & might want to use later.

v0.6 changes

A complete redesign to how directives are declared. The codebase is significantly cleaned up, and the project should be far more maintainable going forward, as well as much more useful as a lexer. I think v0.5 was never fully functional. I believe I abandoned that in favor of a new design for v0.6

Some questions

These are probably not accurate. But I wanted to keep them around & maybe answer them again.

when is 'start' checked?
When the top 'started' list is empty and 'unstarted' is non-empty
when are 'match' and 'stop' checked?
A directive's 'match' and 'stop' are checked if it begins the loop on the top started stack
'stop' is checked after 'match'
neither 'match' nor 'stop' are checked on the same loop that 'start' is checked.
'stop' is checked whether 'match' passes or not
'stop' is propagated upward at the end of lexing (before or after onLexerEnd()? )
How do I customize tree output?
Possibly a custom Ast class. Should be specifiable in ast.new

Changes Prior to v0.5

  • set_previous feature
  • ast_set feature
  • rewind() feature to move the pointer back
  • Wrote Sample code & a draft document regarding a new design for lexer. One that moves away from state-based into expectations-based
  • Some changes and improvements to lexer & grammar
    • Grammar's flow for building regexes is improved
    • Lexer has minor changes to its implementation, especially in regard to automatically popping state & clearing buffers.
  • PhpGramamr2 handles properties, constants, methods, functions, strings, and some other things.
  • Implemented AST caching for files. File to parse checks sha1_file. filemtime is checked for each grammar. Does not check lexer, ast, token, or the base Grammar class files.
  • Write some docs. Clean up status notes
  • Updated existing grammars to work with refactored
  • Refactored lexer & Grammar & added new features.
  • Wrote most of an introduction
  • Small DocBlockGrammar fix
  • Some bug fixes with lexing & state
  • (php grammar)Catch namespace, class, method, docblock, and property
  • Create Docblock grammar
  • Write bash grammar to capture docblocks & function names
  • docblock parsing (to get attributes, basically)
  • PhpGrammar successfully prototyped & catching docblocks, properties, and methods for PHP
  • Ast getTree()
  • Generalized Ast
  • Refactored tests
  • Make its own repo
  • Convert current run script to a tlftest

v0.3 architecture

This is out of date. We no longer use a state approach. Instead, each directive uses thens to point to the next directives to watch for

The lexer manages

  • state: The name of what's being processed at the moment. State is kept on a stack.
    • state may be an asterisk ('*') or asterisk in array (['*']) to be valid for all states
    • Ex: After /** is found, we enter state=='docblock'.
      • When */ is reached, we pop the state & return the parent state. Something like class_body, if the docblock was found inside a class Something { /** docblock here */ }
    • When the state changes from one loop to the next, the list of valid regexes is updated
  • token: The text we're processing with convenience methods to get the current buffer, add a char to the buffer, etc
  • head: The ast at the top of the stack.
    • The initial ast is always on the bottom of the stack & is the first head that is used
    • Grammars append new asts to the stack & can pop them off.
      • Currently, this must be done programmatically in php. There is not a declarative solution.
  • valid_regex_list: The list of directives to check for the current state
  • success_regex: The directive who's pattern matched the current buffer
  • grammar: Directives & methods that allow you to do ast building & parsing with the lexer
  • directive: A regex to match & instructions about what to do when the regex is matched
    • Formerly, directives were simply called regexes

Setting up the lexer environment

  1. Lexer is initialized
  2. Grammars are added to the lexer. Additional work is done during their __constructors
    • Grammar must do additional processing for regex declarations. See Grammar.php for up-to-date implementation info
      • Convert all non-array values to arrays (except set_state)
      • check for onRegexName() method on the grammar and set it to onMatch
        • also checks on_regexName
      • Set state=>['*'] if it wasn't set
      • Set name=> , the key that identified this regex entry in the grammar.
    • Grammar supports a regex_end feature which is not supported by the Lexer. Each regex_end entry is converted into a standalone regex with flags:
      • name=>'endofreg:the_original_regexes_name'
      • regex=> the regex array found at regex_end
      • state=> Whatever set_state is on the original regex
      • pop_state => true
      • onMatch => For a regex with name "string_open", Method onendString_open() if the method exists on the current grammar
    • the regex_end feature is now better used as 'end'=>[/*normal regex declaration*/] where everything in end is copied into its own regex entry
  3. An Ast is created to be the root
    • lex($filePath) creates an ast & sets attributes to that ast
    • For lex($filePath), the cache is checked & a new lex is only done if the file or any of the grammars are different from last time it was run (or if $lexer->useCache is false).
  4. The ast is set as the head
  5. A Token is created from the passed in string (file contents in the case of lex($filePath))
  6. For each grammar onLexerStart($lexer, $ast, $token) is called

lexing begins

  1. using a while loop, set $token = $token->next(), which returns itself with an updated buffer, or false, if there are no more characters to process
    • next() adds one character to the buffer at a time.
  2. Each valid_regex for the current state is now checked.
    • Warning: No more valid_regexes are checked after the first one matches
  3. The first regex that matches is stored as a success_regex
    • The success_regex contains some or all of:
      • cur_match, the current results of preg_matching. where [0] is the full match, [1] is the first set of parentheses & so on

      • regex=>[/regex1/, /regex2/]. Only one regex has to match & and processing stops after the first regex is matched

      • state=>[state1, state_2]. The state $lexer must be in for this regex to be checked

      • state_not=>[state3, state_4]. If $lexer->getState() is one of these, do not check this regex. For every other state, check this regex (unless state=> is also given.)

      • onMatch => the function to call when this regex is matched

      • set_state => new_state. to call $lexer->setState('new_state') when this regex is matched

      • buffer.clear => true to call $token->clearBuffer() when regex is matched

      • pop_state => true to call $lexer->popState() and $token->clearBuffer()

  4. Every success_regex now begins processing
  5. if $debug is on (hardcoded rn), then print state information.
  6. set cur_match to $token->setMatch($cur_match)
  7. Call onMatch function, if it was set
  8. Process every other directive
  9. Set $lexer->state to set_state if key present on the directive
  10. Print additional state information if $debug==true
  11. Call onLexerEnd() on each grammar