Create Language Parser

To parse a new, currently unsupported language, you'll define Directives, written in a simplified programming language. Directives are backed by php code, either built-in to the Lexer or added to your language parser's Grammar.

The goal of parsing code is to create an AST - Asymmetrical Syntax Tree. It's just a multi-dimensional array that details the structure of the parsed code.

(The Lexer could potentionally parse things other than code, such as cooking recipes.)

In this file

  • How the Lexer Works
  • Directives
  • Instruction Examples
  • Available Instructions

How the Lexer Works

The Lexer creates a Token, an Ast stack, and a Directive stack, then loops over each individual character in the input string, adding one character to the Token's buffer on each loop.

On each loop, the head of the Directive stack is processed, and instructions may modify the head ast, add another directive to the top of the stack, create a new ast, rewind the buffer, or do many other tasks. (Note: Most instruction sets start with a match /regex/ instruction, which must succeed to run other instructions./

Each layer of the Directive stack contains an 'unstarted' and a 'started' list (each list is an array of Directives, or an empty array).

Each Directive may contain up to 3 instructions sets: 'start', 'match', and 'stop'. (*@todo We'll talk about the special 'is' instruction set later.)

When the 'started' list is empty, 'unstarted' directives are processed. When an 'unstarted' directive is processed, its 'start' instruction set is executed. If the 'start' instruction set succeeds, the Directive is added to the 'started' list.

When the 'started' list is NOT empty, 'started' directives are processed. When a 'started' directive is processed, its 'match' and 'stop' instruction sets are executed. 'match' goes first. If 'stop' is executed successfully, the Directive is moved back to the 'unstarted' list.

A Directive may add a layer to the directive stack. Then on subsequent loops, the new head directive layer will be processed, and the previous directive layer will be paused, until it is the head layer again.

The Lexer loops in this way, over each character, until all characters are processed.

When the Lexer finishes parsing a string, it returns a detailed AST describing the input file/string.

To recap, Directives and ASTs are both on a stack. Directives are processed on each loop, executing instructions that modify ASTs, create new ASTs, and tell the Lexer which Directives it should run next.

Tip: the lexer has a 'stop_loop' setting for debugging, to stop after a given number of loops.)

Directives & Instruction Sets

A Directive is a named array of instruction sets. Each instruction set contains an array of instructions (lol).

Most instruction sets begin with a match /regex/, which matches against the Token's current buffer. If the regex matches, an unstarted directive is started, and the instructions after the match instruction are processed. If the regex does not match, the rest of the instruction set is not processed.

Instructions

There are 20+ instructions available, and you can directly call methods on your Grammar, the Lexer, or the Token.

Instructions can be a string like 'token.rewind 3' or a key/value pair like 'ast.new' => ['type'=>'class','name'=>'_token:buffer' ...], which creates a new ast.

If defining a key/value pair, the key is the instruction and the key may conain arguments, and the value is an argument to pass to the instruction.
Tip: This allows arrays to be passed to instructions.
Tip: If the value is (strict boolean) false, the instruction is disabled.
Tip: If the key begins with an underscore (_), the instruction is disabled.

The instruction can include arguments, separated by a space. The key may end in a special/reserved argument. The reserved arguments are ..., [], !, and // comment. The value may be any php data type, depending on the instruction's requirements & any reserved args that are used.

(Tip: Add comments if you ever have two identical keys in an instruction set.)

If you only define a value (no string key), then the value is your instruction, and reserved arguments are unavailable.

Instruction Examples

"instruction a b c" passes three arguments ('a','b','c') to the instruction
"object:method arg1 arg2" calls a php object's method, passing two args ('arg1', 'arg2')
"instruction a" => "b b" passes two arguments ('a', 'b b') to the instruction
"instruction a" => ['b', 'c', 'd'] passes two arguments to the instruction ('a', ['b','c','d'])
"instruction a ..." => ['b', 'c', 'd'] passes four arguments ('a','b','c','d') to the instruction.
"instruction []" => 'value' throws an exception because [] is reserved for future use.
"instruction !" => '\_object:method arg1 arg2' calls the named php object/method, and the return value is passed to the instruction.

For object:method, you can call any public method on the object, and available objects are:

  • lexer, \Tlf\Lexer
  • token, \Tlf\Lexer\Token
  • ast, \Tlf\Lexer\Ast - the head ast
  • this, \Tlf\Lexer\Grammar - the Grammar attached to the current directive (The Grammar that defines the current directive).
  • any other grammar name. (must be a grammar that's added to your lexer instance)

Non-Grammar methods are called like $object->method(...$args) where $args is what's defined in your instruction, like ['arg1', 'arg2'].

Grammar methods receive ($lexer, $headAst, $token, $directive, $args), where $args is what's defined in your instruction, like ['arg1', 'arg2'].

Available Instructions

  • match /regex/ - Start directive & continue processing instruction set if /regex/ matches the current token buffer
    • or buffer.match
  • then :directive_name - Add the named directive to the directive stack. Creates a new layer on the stack once per loop.
    • or directive.then
    • then grammarname:directive_name - Add a directive of another grammar, such as from the docblock grammar
    • then directive_name.stop - Add the named directive's 'stop' instruction set as a 'start'
    • "then :+new_directive_name" => $directive - Create a new directive to add to the stack, instead of referencing a named directive.
    • "then _blank" => $directive - same as to :+
    • "then directive_name" => $directive_overrides - To add a directive, but override parts of it. See Grammar.php getOverriddenDirective().
  • then.pop :directive_name layers_to_pop - Add a directive to the stack & immediately pop the directive layer when it is matched (instead of the directive's normal functioning).
    • layers_to_pop is an int
    • Rewinds by the length of the first capture group from the target directive's match
    • Creates new Directive stack layer before pop if it is the first then call this loop
  • buffer.notin key - Check if the current buffer matches a string in your grammar's public $notin. If no match, then clear the buffer and ... i think start/stop directive? idk
    • Your grammar defines public $notin = array<string key, array array_of_strings>.
    • buffer.notin key checks !in_array($grammar->notin['key'], 'current_buffer')
  • "ast.new" => [] - Create a new ast
    • arg type=>class or _type - a string type or _object:method arg1 arg2 to call on lexer, token, or head ast
    • arg _setHead=true|false - (optional) true to add to top of ast stack. false to not. default is true.
    • arg _class=>PhpClass - (optional) the Ast's php class.
    • arg _setto=>property - (optional) Add the new ast to the current head ast's named property.
    • arg _addto=>property - (optional) Same as _setto
    • arg _setPrevious=>key - (optional) Set the new ast to the 'previous' key.
      • Ex: we create a docblock ast & we _setPrevious=>'docblock', then we encounter a class and retrieve it with $lexer->previous('docblock'), to set the class's docblock.
    • If not _setto or _addto, then if the type is 'class', then the new ast is added to the current head ast's 'class' property.
    • Any other key/value pair - the key is the name of a property on the ast. The value is either the value to set, or it calls an _object:method arg1 arg2 if it is a string starting with an underscore (_). Ex: _token:buffer would set the current buffer string to the property.
  • debug.die or die - Same as debug.print but exits.
  • debug.print or print - Shows what php values were created from your instruction.
  • directive.inherit [:directive.isn] ["match"] or inherit - Run commands of another directive, except for the 'match' instruction.
    • Ex: inherit :string_instructions.stop
    • Include literal string 'match' to enable the 'match' instruction.
    • arg :directive.isn - isn is the instruction set name ('start', 'match', or 'stop').
  • directive.start or start - Mark the current directive as started.
    • You can start a Directive with start instead of match.
  • directive.stop or stop - Mark the current directive as stopped
    • Allows a 'match' instruction set to stop a directive
  • token.rewind [num_chars] or rewind - Rewind the token.
  • token.forward [num_chars] or forward - Move the token forward
  • directive.halt or halt - Halt the current directive, so further instructions in the active instruction set will not be processed. Other instruction sets in the active stack list will be processed.
  • halt.all - Halt the all other instructions sets waiting to be processed. To also halt the active instruction set, call 'directive.halt' AFTER 'halt.all'
  • previous.set [key] - Set the current buffer to the 'previous' key/value set, for the given key.
    • Ex: previous.set docblock is used to capture a docblock, then when the next class or function is found, the Directive will call "ast.set docblock !" => _lexer:previous docblock
  • previous.append [key] - Append the current buffer to the 'previous' key/value set, for the given key.
    • "previous.append" => ['statement', 'method_declaration'] also works
  • directive.stop_others - Loop through the list of all other started directives and move them to the unstarted list. Does NOT stop the active directive (the one calling directive.stop_others).
  • directive.pop [num_layers] - Pop layers off the Directive stack.
  • buffer.clear - Clear the buffer
  • buffer.clearNext [num_chars] - Progress the buffer forward [num_chars], but do NOT add those chars to the buffer. May corrupt the token...
  • buffer.appendChar [string] - Append a string to the buffer. May corrupt the token...
  • ast.pop - Remove the head AST from the top of the stack, unless it's the last one.
  • "ast.set [property] !" => '_object:method arg1 arg2' - Set head AST's [property] to $object->method('arg1','arg2')'s return value.
    • ast.set [property] will set the head AST's [property] to the current buffer.
    • Ex: "ast.set docblock !" => '_lexer:previous docblock' will get the docblock from the 'previous' key/value set, and set the 'docblock' property on the head ast.
  • ast.push [property] - Append the current buffer to given array property on the head ast.
  • "ast.append [property] !" => '_object:method arg1 arg2' - Append to the head AST's [property]. Same as ast.set, except this appends.

Example

// @todo make a better example

['docblock'=>  
    [    
        'start'=>[    
            'match'=>'##',    
        ],    
        'stop'=>[    
            'match'=>'/(^\s*[^\#])/m',    
            'rewind 2',    
            'this:handleDocblockEnd',    
            'buffer.clear',    
            // 'forward 2'    
        ]    
    ]  
];