Architecture

This is a pretty poor explanation of the architecture, but its good enough for me. If its not good enough for you, please submit a PR.

The Lexer manages the directive stack, overall program execution, and most things. The Token manages our placement in the input string. The Grammars declare directives which have instructions to perform when they're matched. These instructions can modify the token, the lexer, and create new Asymmetrical Syntax Trees (ASTs) to create structured representation of the source. Grammars also declare methods to use as additional instructions.

If you want to write a grammar, see docs/GrammarWriting.md. (but it might help to understand the architecture).

Flow of Lexing

  1. Lexer is initialized
  2. Grammars are added to the lexer.
  3. An Ast is created to be the root
    • Lexer has convenience methods, or you can create your own AST to lex.
  4. The ast is set as the head
  5. A Token is created from the input string
  6. For each grammar onLexerStart($lexer, $ast, $token) is called
  7. Perform the lexing (see below)
  8. For each grammar onLexerEnd($lexer, $ast, $token) is called

The Pieces

  • Lexer: takes Grammars to process an input string using a Token
    • $directiveStack: A multi-layered stack of directives. Each layer can have multiple directives. Each layer has a 'started' and 'unstarted' list.
    • $astStack: The stack of ASTs. Generally the head ast is operated on
  • Token: Contains the input string. Manages our position in the input string.
  • Grammar: Declares directives
    • directive: A set of targets & instructions. Generally contains a string or regex (target) to match against & instructions for what to do upon that match.
  • Ast: An Asymmetrical Syntax Tree... holds values & can be output as an array

The Lexing

  1. using a while loop, set $token = $token->next(), which returns itself with an updated buffer, or false, if there are no more characters to process
    • next() adds one character to the buffer at a time.
  2. Each started directive is checked for match and stop. If there are no started directives, then unstarted directives are checked for start
  3. If started directives stop, they're moved back into unstarted. And visa versa when unstarted directive start.
  4. Any regexes that passed in step #3 are now processed for their instructions, in the order those instructions were declared.
  5. thens are processed & any target directives are added to a new layer of the directive stack
  6. Repeat from #1 until the token has been fully buffered