Figuring out how to write a Lexer
Intro
Below is what I typed up while figuring out how to design my lexer. There was a lot more work that went into it than what's capture here. Namely, a very lenghty process of considering an input string & exactly what should happen on every single loop, up to 19 loops (1 character from the input string added on each loop). This whole process was exhausting, exciting, and super challenging.
I didn't really neeeed to make a lexer, because there is existing software that generates ASTs from source code... But I did anyway. I started with a very very simple version (long before this doc was written), thinking... I'll abandon it if it takes more than a couple hours to build... But then I've gone over several iterations as this project has become interesting to me.
Now, I have a functional lexer, and I am working on writing Grammars. PhpGrammar will come first. Currently, I have a partial Json Grammar I used when developing the newest version of the lexer, but it misses several things that JSON allows, because I don't care to develop that grammar further. I'm sure I'll need to add some new features while I'm writing the PHPGrammar, too.
Anyway, nothing below is edited from when I originally wrote it. Hope its interesting. Thanks for reading!
There's still more work before its prime time, but you can checkout my lexer on Gitlab at https://gitlab.com/taeluf/php/lexer
My Process
My goal for a lexer is to proccess a string and produce an AST filled with information about the string. This can be done with massive regex-matching on the full string or matching on chunks of the string, one at a time. The chunks of the string could reasonably be line-by-line. Or each chunk can be a single character. Or a buffer can be maintained and have one character (or line) added at a time, then the buffer is checked.
When a chunk (buffer or character or line) is matched by a directive, additional processing is performed. This processing can be done by the lexer by reading the directive. It can be done by a method the directive is mapped to. Or a combination can be used where the lexer does some work automatically based upon declarations AND a custom method can do additional, more complex work than what is avaialable declaritvely.
Now, how directives can be defined and processed has many options. A single string or regex match could invoke processing of the current chunk, allowing the chunk to be analyzed and put into ASTs as needed.
Whether a directive should be considered on the current chunk can also be determined by many approaches. Is the lexer in the state required by the directive? Does a previous directive require this one to be considered? Does the previous sequence allow this directive? Does the following sequence of characters indicate an end to this directive?
And I've started thinking about sequences. A sequence of characters is a string that can be individually evaluated for AST purposes. For example /** comment */namespace \Tlf\Lexer;
contains 5 sequences. #1 is the docblock-style comment. #2 is namespace
. #3 is
. #4 is \Tlf\Lexer
. #5 is ;
- The docblock sequence may define the following sequence or a sequence before it. For example in
<?php\n/** docblock**/\nnamespace Cats;
, the docblock describes the Code Block. Add an additional\n
after<?php
and now it describesnamespace Cats;
-
namespace
indicates the next thing we allow & expect. We can allow whitespace, comments, and docblocks. We expect a namespace name in the regex/^[a-zA-Z_\\]+$/
but we can't evaulate this until a semicolon is reached or we'll only have a partial namespace name.- This is further complicated by
namespace \Tlf/**bogus*/\/*comment
. Valid segments of the namespace name are([a-zA-Z0-9_]+|\\)
. These valid segments need all to be joined while invalid but allowed segments are removed. Invalid allowed statements are docblocks, comments, and whitespace - So when I encounter
namespace
I begin seeking the namespace name. The namespace name is built from a series of valid segments which can be interrupted by certain other statements. - So more accurately,
^[a-zA-Z0-9_]+$
is a valid segment and must be followed by a divider, a comment, whitespace, or a terminator (semi-colon). Anything allowed bynamespace
is allowed after this segment, except for another one of these segments.
- This is further complicated by
Let's start again & break down the sequences for /**comment*/namespace\ Tlf/*interrupt*/\Lexer /*okay*/ ;
NOTE: The \
BEFORE Tlf
is not actually allowed for ns declaration & has a different functionality & is used for calling within the current namespace..
Terms
- separator: docblock, whitespace, or a comment
- terminator: a semi-colon
-
a-z
regex:^[0-9_a-zA-Z]+$
<- this can ONLY be evaluated when the next breaks the regex
-
/**comment*/
docblock- Expect whatever was already expected by the previous scope/directive. (i.e. whatever is expected after
<?php
)
- Expect whatever was already expected by the previous scope/directive. (i.e. whatever is expected after
-
namespace
begin looking for namespace name.- Expect
\
or a separator- A separator is a docblock, whitespace, or a comment
-
^[0-9_a-zA-Z]+$
will be expected after a separator....
- Expect
-
\
Begin namespace name & append this char to it- Expect a separator or the a-z regex
-
- Expect a separator or the a-z regex
-
Tlf
Append to namespace name- Expect a separator, terminator (
;
), or\
- Expect a separator, terminator (
-
/*interrupt*/
docblock- Expect a separator, terminator, or
\
- Expect a separator, terminator, or
-
\
- append to namespace name- Expect a separator or a-z regex
-
Lexer
- append to namespace name- Expect a separator, terminator, or
\
- Expect a separator, terminator, or
-
- Same expectations as #7
-
/*okay*/
docblock- Same expectations as #8
-
- Same expectations as #9
-
;
terminator- Store the namespace name
- pop state / directive (we are no longer looking for a namespace name, so the last set of expectations need be removed)
What sets of expectations are there above?
- Expect
\
or a separator - Expect a separator or
a-z
regex - Expect a separator, a terminator, or a
\
Every individual thing I can expect is:
-
\
- separator (whitespace, comment, docblock)
-
a-z
regex -
;
What are my expectations BEFORE namespace
? After ;
?
- Expectations before come from the
<?php
directive. - After also comes from the
<?php
directive.
How else can namespace
be limited?
-
namespace
not allowed if there has been another statement before it
An a-z
regex also CANNOT be a keyword
like class
, abstract
, or others
-
namespace
begin looking for namespace name.- Expect
\
or a separator- A separator is a docblock, whitespace, or a comment
-
^[0-9_a-zA-Z]+$
will be expected after a separator....
- Expect
namespace
begin looking for namespace name
- expect a separator or an a-z regex
- if the next non-separator is a \
, then cancel namespace name declaration & return to previous scope. But then I have a statement that needs processing & it needs processing
Okay, so there are two directives when namespace
is found. The namespace\function();
type & the namespace Declaration;
type. They both have to be evaluated continuously until one of them fails validation.
So the first namespace directive:
- active when
namespace
is found at the end of the buffer - de-activated when
;
is found - expect a separator or a-z directive or
;
- When separator found, discard it
- When
;
found,- if namespace name is empty, report error & throw away the namespace. (continue parsing)
- if namespace name non-empty & ends with
\
, report error & throw away the namespace. (continue parsing) - if namespace name non-empty & begins with
\
, discard as its thenamespace\foo()
syntax
- When a-z directive found, append it to namespace name. expect separator,
\
. Inherit;
expectation- When
;
found, bubble up to namespace's;
expectation - When separator found, discard it
- When
\
found, append it to namespacename. Expect separator or a-z regex. Inherit;
expectation- When
;
found, report error & bubble up to namespace's;
expectation - when separator found, discard it
- When a-z regex found, run a-z directive (which will ultimately re-run this directive)
- When
- When
<?php
$directives = [
'namespace'=>[
'begin'=> 'namespace',
'expect'=> ['separator', 'a-z', 'terminator'],
':separator'=>[
'is'=> ['docblock', 'comment', 'whitespace']
],
':a-z directive'=>[
'terminate_on_regex'=> '/[^[a-zA-Z0-9_]$]/';
'regex'=> '/^[a-zA-Z0-9_]+$/',
'onNoLongerMatching' => 'append to namespace',
'onMatch'=> 'append to namespace', //probably use a function here
'expect'=>['separator', 'divider', 'inherit:terminator'],
],
':terminator'=>[
],
':divider' => [
'string'=>'\\'
],
],
'docblock'=>[], //its whole own directive
'comment'=>[], // its whole own directive
'whitespace'=>[], //its whole own directive
];
- When anything else is found, this
namespace
directive is removed from directives being currently evaluated
So there are different ways to do this:
- Evaluate after end-character is reached
- Evaluate in an ongoing basis & create rigid expectations that limit what the final result is
The 2nd namespace directive:
This rule needs to be present to ensure proper functioning of the namespace declarations rule. Without this rule, if someone uses the namespace\go();
syntax, then we will be stuck in a waiting-for-namespac
state... unless the other namespace directive still catches ;
& will just throw away if the rest of its stuff failed.
- active when
namespace
is found at the end of a buffer - Expect a separator or
\
. - When
\
found, expect a separator or[a-zA-Z0-9_]
(code, one char)- To simplify expecting code, since I'm not trying to add that feature yet, I can expect strings
- When code is found, expect a semi-colon or separator
- IF a separator is found, discard it & continue prior expectation
- When semi-colon is found, discard everything.