Info I wanna keep around just because
Some questions
These are probably not accurate. But I wanted to keep them around & maybe answer them again.
when is 'start' checked?
When the top 'started' list is empty and 'unstarted' is non-empty
when are 'match' and 'stop' checked?
A directive's 'match' and 'stop' are checked if it begins the loop on the top started stack
'stop' is checked after 'match'
neither 'match' nor 'stop' are checked on the same loop that 'start' is checked.
'stop' is checked whether 'match' passes or not
'stop' is propagated upward at the end of lexing (before or after onLexerEnd()? )
How do I customize tree output?
Possibly a custom Ast class. Should be specifiable in ast.new
Changes Prior to v0.5
- set_previous feature
- ast_set feature
- rewind() feature to move the pointer back
- Wrote Sample code & a draft document regarding a new design for lexer. One that moves away from state-based into expectations-based
- Some changes and improvements to lexer & grammar
- Grammar's flow for building regexes is improved
- Lexer has minor changes to its implementation, especially in regard to automatically popping state & clearing buffers.
- PhpGramamr2 handles properties, constants, methods, functions, strings, and some other things.
- Implemented AST caching for files. File to parse checks
sha1_file
.filemtime
is checked for each grammar. Does not checklexer
,ast
,token
, or the baseGrammar
class files. - Write some docs. Clean up status notes
- Updated existing grammars to work with refactored
- Refactored lexer & Grammar & added new features.
- Wrote most of an introduction
- Small DocBlockGrammar fix
- Some bug fixes with lexing & state
- (php grammar)Catch namespace, class, method, docblock, and property
- Create Docblock grammar
- Write bash grammar to capture docblocks & function names
- docblock parsing (to get attributes, basically)
- PhpGrammar successfully prototyped & catching docblocks, properties, and methods for PHP
- Ast getTree()
- Generalized Ast
- Refactored tests
- Make its own repo
- Convert current run script to a tlftest
v0.3 architecture
This is out of date. We no longer use a state
approach. Instead, each directive uses then
s to point to the next directives to watch for
The lexer manages
-
state
: The name of what's being processed at the moment. State is kept on a stack.-
state
may be an asterisk ('*'
) or asterisk in array (['*']
) to be valid for all states - Ex: After
/**
is found, we enterstate=='docblock'
.- When
*/
is reached, wepop
the state & return the parent state. Something likeclass_body
, if the docblock was found inside aclass Something { /** docblock here */ }
- When
- When the
state
changes from one loop to the next, the list of valid regexes is updated
-
-
token
: The text we're processing with convenience methods to get the current buffer, add a char to the buffer, etc -
head
: Theast
at the top of the stack.- The initial ast is always on the bottom of the stack & is the first
head
that is used - Grammars append new
ast
s to the stack & can pop them off.- Currently, this must be done programmatically in php. There is not a declarative solution.
- The initial ast is always on the bottom of the stack & is the first
-
valid_regex_list
: The list of directives to check for the current state -
success_regex
: The directive who's pattern matched the current buffer -
grammar
: Directives & methods that allow you to do ast building & parsing with the lexer -
directive
: A regex to match & instructions about what to do when the regex is matched- Formerly, directives were simply called regexes
Setting up the lexer environment
- Lexer is initialized
- Grammars are added to the lexer. Additional work is done during their
__construct
ors- Grammar must do additional processing for regex declarations. See Grammar.php for up-to-date implementation info
- Convert all non-array values to arrays (except
set_state
) - check for
onRegexName()
method on the grammar and set it toonMatch
- also checks
on_regexName
- also checks
- Set
state=>['*']
if it wasn't set - Set
name=>
, the key that identified this regex entry in the grammar.
- Convert all non-array values to arrays (except
- Grammar supports a
regex_end
feature which is not supported by the Lexer. Eachregex_end
entry is converted into a standaloneregex
with flags:-
name=>'endofreg:the_original_regexes_name'
-
regex=>
the regex array found atregex_end
-
state=>
Whateverset_state
is on the original regex -
pop_state => true
-
onMatch =>
For a regex with name"string_open"
, MethodonendString_open()
if the method exists on the current grammar
-
- the
regex_end
feature is now better used as'end'=>[/*normal regex declaration*/]
where everything inend
is copied into its own regex entry
- Grammar must do additional processing for regex declarations. See Grammar.php for up-to-date implementation info
- An
Ast
is created to be the root-
lex($filePath)
creates an ast & sets attributes to that ast - For
lex($filePath)
, the cache is checked & a new lex is only done if the file or any of the grammars are different from last time it was run (or if$lexer->useCache
isfalse
).
-
- The ast is set as the
head
- A
Token
is created from the passed instring
(file contents in the case oflex($filePath)
) - For each grammar
onLexerStart($lexer, $ast, $token)
is called
lexing begins
- using a while loop, set
$token = $token->next()
, which returns itself with an updated buffer, orfalse
, if there are no more characters to process-
next()
adds one character to the buffer at a time.
-
- Each
valid_regex
for the current state is now checked.- Warning: No more
valid_regex
es are checked after the first one matches
- Warning: No more
- The first regex that matches is stored as a
success_regex
- The
success_regex
contains some or all of:-
cur_match
, the current results ofpreg_match
ing. where[0]
is the full match,[1]
is the first set of parentheses & so on -
regex=>[/regex1/, /regex2/]
. Only one regex has to match & and processing stops after the first regex is matched -
state=>[state1, state_2]
. The state$lexer
must be in for this regex to be checked -
state_not=>[state3, state_4]
. If$lexer->getState()
is one of these, do not check this regex. For every other state, check this regex (unlessstate=>
is also given.) -
onMatch =>
the function to call when this regex is matched -
set_state
=>new_state
. to call$lexer->setState('new_state')
when this regex is matched -
clear_buffer => true
to call$token->clearBuffer()
when regex is matched -
pop_state => true
to call$lexer->popState()
and$token->clearBuffer()
-
- The
- Every
success_regex
now begins processing - if
$debug
is on (hardcoded rn), then print state information. - set
cur_match
to$token->setMatch($cur_match)
- Call
onMatch
function, if it was set - Process every other directive
- Set
$lexer->state
toset_state
if key present on the directive - Print additional state information if
$debug==true
- Call
onLexerEnd()
on each grammar