February 15, 2021 —John Koster
In this article we are going to take a look at implementing a custom language parser in PHP. The language this parser will analyze will be fairly small, and serve a very specific purpose: to parse custom filter expressions from a string input. These type of specific languages are often referred to as domain specific languages, or DSLs; meaning that they are a custom, often light-weight, language that solves a very particular problem and have no broader scope beyond that.
The syntax we will be writing a parser for is a subset of the filtering syntax implemented by Meerkat, a comment system for Statamic websites. The language we will be parsing looks like this in its most basic form:
1where(name, =, 'value')
Each filter is composed of the following components:
where
in the example);name
in the example), if applicable;=
in the example), if applicable;value
in the example), if applicableA filter can have a single input value:
1is:published(true)
Or an arbitrary large amount of input values:
1where(name, =, 'value', 'value2', 'value3', 'value4')
Multiple filters can also be supplied when separated by a pipe (|
) character:
1filter:one(value)|filter:two(value2)|filter:three(value3)
In early versions of Meerkat's implementation of this language syntax, it was simply a wrapper around PHP's explode
function, which it uses to split strings based on the pipe character. This method, however, makes it impossible to supply input similar to the following:
1filter('string value with an embedded | character')
Similar basic implementation details also make it impossible to have string inputs that contain a single quote (the following example uses the \'
quote escape sequence, as that is the target intended behavior):
1filter('string value with an embedded single quote: \' ')
The first step we will take in implementing our language parser is to split the input up into individual characters (we will refer to them as tokens) so they can be analyzed one-by-one. The first reaction might be to reach for str_split
:
1$tokens = str_split('input string');
This approach will generally work well with English input, as that input tends to be one byte per character (it is important to remember that str_split
splits the text input into an array where each element is one byte). Meerkat has many non-English implementations, so this will not work out so well. To account for this, we will use the mb_str_split
function to separate our input string into an array based on the actual byte-length of each individual character:
1$tokens = mb_str_split('input string');
The output of this would be the following:
1array(12) { 2 [0] => 3 string(1) "i" 4 [1] => 5 string(1) "n" 6 [2] => 7 string(1) "p" 8 [3] => 9 string(1) "u"10 [4] =>11 string(1) "t"12 [5] =>13 string(1) " "14 [6] =>15 string(1) "s"16 [7] =>17 string(1) "t"18 [8] =>19 string(1) "r"20 [9] =>21 string(1) "i"22 [10] =>23 string(1) "n"24 [11] =>25 string(1) "g"26}
Great, we now have a way to split our input string into individually analyzable pieces of text. We will now set up some basic scaffolding and work to rebuild our input string by iterating our tokens. In the next code example, we will simply get each character in our tokens list, and add them to a string named $currentSegment
:
1<?php 2 3$tokens = mb_str_split('input string'); 4$inputLength = count($tokens); 5 6$currentSegment = ''; 7 8for ($i = 0; $i < $inputLength; $i++) { 9 $current = $tokens[$i];10 11 $currentSegment .= $current;12}
While incredibly trivial, that code example will be what we continue to build on for the remainder of this article; each incremental section will contain the code from previous sections to reduce the amount of cross-referencing that needs to happen (at the cost of article length). Let's take a step back from writing code, and take another look at what our input will be and get a better understanding of what exactly that is.
Before writing any more code, we need to take a moment to understand exactly what it is we are trying to analyze. Let's consider the following example input:
1where(name, =, 'Alice\'s name')|where(author, =, '\\')|is:published(true)
This input string contains three filter expressions separated by a pipe (|
) character. When the pipe character is used to separate filter expressions we will refer to it as the filter delimiter; when it is used inside a string it will have no special meaning, and continue to be just a pipe character. We will remove these special characters and write each filter expression on their own line:
1where(name, =, 'Alice\'s name')2where(author, =, '\\')3is:published(true)
From the perspective of our parser an individual expression is only made up of two parts: a name, and a list of input input (it will be up to other code to interpret these things and make sense of them). The format of a filter expression is then:
1<FILTER_NAME><LEFT_PARENTHESIS><INPUT_LIST><RIGHT_PARENTHESIS>
From this format now see we have to more special characters to consider: a left parenthesis (
character when not used inside of a string indicates that a filter name has ended, and the input list is starting, and the right parenthesis )
character indicates that the input list has ended. When the (
character is used to start an input list we will refer to it as the input start character, and when the )
character is used to end an input list we will refer to it as the input end character. Neither of these characters will have special meaning when used inside strings. Next, we will look at the input list itself.
The first filter expression has the following as its input list (the input start and end characters have been removed):
1name, =, 'Alice\'s name'
The following is a list of the desired parser output:
1name2=3Alice's name
We can see that each value in our input list is separated by a comma (,
). When a comma is used to separate input values, we will refer to it as the input delimiter character; again, when it is used inside string input, it will have no special meaning. The last input value is a string, and is a little bit more interesting.
String values will start and end with single quotes ('
). When a single quote is used to denote the start or end of a string, it will be referred to as the string delimiter character. String input is a little bit more complicated when it comes to escape sequences (the technique we will utilize to allow strings to contain single quotes, and to account for an edge-case that arises as a side effect of that implementation), and before we get into that, lets take a quick moment to list what we've discovered so far about our input filter expressions:
The parser will only accept strings that begin and end with a single quote, and will not allow for the use of double quotes to denote string input values.
Name | Character | Description |
---|---|---|
Filter Delimiter | ` | ` |
Input Start | ( |
Denotes the start of an input list |
Input End | ) |
Denotes the end of an input list |
Input Delimiter | , |
Separates values within an input list |
String Delimiter | ' |
Denotes the start or end of a string value |
The expression example we've been looking at so far has the string 'Alice\'s name'
as one of its input values, and the desired output is Alice's name
. Because the single quite ('
) character is already being used to denote the start and end of a string, the following would be ambiguous since we cannot state anything about this value with confidence (is Alice
the string's value, and 's name'
is a syntax error, or is it the other way around?):
1'Alice's name'
By convention, we will say that when a single quote is proceeded by a backslash (\
) character within a string, only the single quote will become part of the string's value. When a backslash is used as a way to indicate special output within a string, we will refer to it as the escape character. When a valid character follows the escape character, such as a single quote, we will refer to both the backslash and that next character as an escape sequence.
Later in this article we will run into an implementation issue where it becomes difficult to output string values that contain a single backslash character, or the literal value of a backslash followed by a single quote (without intending it to be an escape sequence):
1\'
To allow for this, we will define the escape sequence \\
(the escape character followed by a literal \
character) to output a single backslash.
Our list of special characters has now grown to this:
Name | Character | Description |
---|---|---|
Filter Delimiter | ` | ` |
Input Start | ( |
Denotes the start of an input list |
Input End | ) |
Denotes the end of an input list |
Input Delimiter | , |
Separates values within an input list |
String Delimiter | ' |
Denotes the start or end of a string value |
Escape Character | \ |
Starts an escape sequence within a string value |
And our list of valid escape sequences is:
Sequence | Desired Output |
---|---|
\' |
' |
\\ |
\ |
We will consider all of these characters special, and unless they are used within a string (and with the correct escape sequence), they will not be available for use in filter names or non-string input values. With this in mind, the following filter name would be invalid:
1in,v|a(lid):name(input, list)
Now that we have a better understanding of our input, it would be a good time to decide what we want the output of our parser to be. In the previous section we decided that, from the perspective of the parser, each individual filter expression is composed of a name and a list of input values. The input may contain multiple filter expressions. There are many different ways we could represent this, but we will use a simple associative array to represent a single filter:
1[2 'name' => 'filter:name',3 'input' => [4 'list',5 'of',6 'input',7 'values'8 ]9]
The final output of the parser would just be an array of arrays with that form.
We will start our parser implementation by defining our characters as constants. We could simply write them as strings each time we need them in our code, but it is easy to make a mistake that way once things get going. We will build on the code that we started the article with:
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split('input string');11$inputLength = count($tokens);12 13$currentSegment = '';14 15for ($i = 0; $i < $inputLength; $i++) {16 $current = $tokens[$i];17 18 $currentSegment .= $current;19}
To start, we are going to change how we are getting our input for the mb_str_split
function just so we can avoid looking at PHP's string escape sequences while we attempt to build our own later. To do this, we will create an input.txt
file alongside our PHP file with the following contents (we are going to use simple input for now to get started):
1where(name, =, value)
We will also update the code to read from this file:
1$tokens = mb_str_split(file_get_contents('input.txt'));
Great, we can now modify our input string without confusing ourselves by mixing our custom string escape sequences and the PHP escape sequences. Now the question is where do we even begin?
We will start simple and attempt to get the filter name separated from the rest of the input string. In the previous section when we described we know that the filter name is the text that appears before the input start character ((
). To extract this, we can keep concatenating the $currentSegment
string until we reach that character:
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split(file_get_contents('input.txt'));11$inputLength = count($tokens);12 13$filterName = '';14$currentSegment = '';15 16for ($i = 0; $i < $inputLength; $i++) {17 $current = $tokens[$i];18 19 if ($current === TOKEN_INPUT_START) {20 $filterName = $currentSegment;21 $currentSegment = '';22 23 continue;24 }25 26 $currentSegment .= $current;27}
After this has executed, the $filterName
variable would contain the value where
, and the $currentSegment
variable would contain the value name, =, value)
. You may have also noticed that once we hit the TOKEN_INPUT_START
character we reset the value of $currentSegment
. We will do this whenever we know what type of data is contained in the $currentSegment
variable (in this instance, it was the filter's name). The next step will be to start parsing the input list.
To parse the input list, we will continue to iterate each of the remaining characters. Once we hit the TOKEN_INPUT_DELIMITER
character we will add the value of the $currentSegment
to a new list we will create. We will also take this time to update our existing code to add a new $isParsingInput
flag we will make use of later:
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split(file_get_contents('input.txt'));11$inputLength = count($tokens);12 13$filterName = '';14$currentSegment = '';15$filterInputs = [];16$isParsingInput = false;17 18for ($i = 0; $i < $inputLength; $i++) {19 $current = $tokens[$i];20 21 if ($current === TOKEN_INPUT_DELIMITER) {22 $filterInputs[] = $currentSegment;23 $currentSegment = '';24 25 continue;26 } else if ($current === TOKEN_INPUT_START) {27 $filterName = $currentSegment;28 $currentSegment = '';29 $isParsingInput = true;30 31 continue;32 }33 34 $currentSegment .= $current;35}
Once this code has finished executing, the $filterInputs
array will contain the following values:
name
=
This isn't quite right, as we also expect the value
input value to be part of that list. It is not currently in the list since our loop has finished iterating the array without taking any additional actions since no conditions were met. We will fix this by adding a new condition that checks for the TOKEN_INPUT_END
token:
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split(file_get_contents('input.txt'));11$inputLength = count($tokens);12 13$filterName = '';14$currentSegment = '';15$filterInputs = [];16$isParsingInput = false;17 18for ($i = 0; $i < $inputLength; $i++) {19 $current = $tokens[$i];20 21 if ($current === TOKEN_INPUT_END) {22 $filterInputs[] = $currentSegment;23 $currentSegment = '';24 $isParsingInput = false;25 26 continue;27 } else if ($current === TOKEN_INPUT_DELIMITER) {28 $filterInputs[] = $currentSegment;29 $currentSegment = '';30 31 continue;32 } else if ($current === TOKEN_INPUT_START) {33 $filterName = $currentSegment;34 $currentSegment = '';35 $isParsingInput = true;36 37 continue;38 }39 40 $currentSegment .= $current;41}
Now once our code executes, the value of $filterName
would be where
and $filterInputs
would contain the following items:
name
=
value
Our next step will be to create the array that will represent a single filter in our input. We know from our input description that an individual filter ends when the TOKEN_INPUT_END
token is detected.
When we detect the TOKEN_INPUT_END
token we will create a new array containing our filter's name and input values (we will also trim all of the input values at this point to remove extra whitespace):
Trimming the whitespace of the input values is not a technical requirement of the parser itself, and is purely a convenience step to clean up the input values. We could have expanded the list of tokens we are analyzing to include whitespace, and handle it in a better way, but it is not a requirement for this particular use case.
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split(file_get_contents('input.txt'));11$inputLength = count($tokens);12 13$filters = [];14$filterName = '';15$currentSegment = '';16$filterInputs = [];17$isParsingInput = false;18 19for ($i = 0; $i < $inputLength; $i++) {20 $current = $tokens[$i];21 22 if ($current === TOKEN_INPUT_END) {23 $filterInputs[] = $currentSegment;24 $currentSegment = '';25 $isParsingInput = false;26 27 $filters[] = [28 'name' => $filterName,29 'input' => array_map('trim', $filterInputs)30 ];31 32 $filterName = '';33 $filterInputs = [];34 35 continue;36 } else if ($current === TOKEN_INPUT_DELIMITER) {37 $filterInputs[] = $currentSegment;38 $currentSegment = '';39 40 continue;41 } else if ($current === TOKEN_INPUT_START) {42 $filterName = $currentSegment;43 $currentSegment = '';44 $isParsingInput = true;45 46 continue;47 }48 49 $currentSegment .= $current;50}
Once our code has executed, our $filters
variable would contain the following value:
1array(1) { 2 [0]=> 3 array(2) { 4 ["name"]=> 5 string(5) "where" 6 ["input"]=> 7 array(3) { 8 [0]=> 9 string(4) "name"10 [1]=>11 string(1) "="12 [2]=>13 string(5) "value"14 }15 }16}
After the previous section, our parser can now extract a single input filter from the input string. Let's update the input.txt
with the following value:
1where(name, =, value)|is:published(true)
We will add a new condition to our parser that checks if the current token is the TOKEN_FILTER_DELIMITER
token; if it is, we will reset our current values and move on without doing anything with the $current
character:
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split(file_get_contents('input.txt'));11$inputLength = count($tokens);12 13$filters = [];14$filterName = '';15$currentSegment = '';16$filterInputs = [];17$isParsingInput = false;18 19for ($i = 0; $i < $inputLength; $i++) {20 $current = $tokens[$i];21 22 if($current === TOKEN_FILTER_DELIMITER) {23 $currentSegment = '';24 $filterName = '';25 $filterInputs = [];26 27 continue;28 } else if ($current === TOKEN_INPUT_END) {29 $filterInputs[] = $currentSegment;30 $currentSegment = '';31 $isParsingInput = false;32 33 $filters[] = [34 'name' => $filterName,35 'input' => array_map('trim', $filterInputs)36 ];37 38 $filterName = '';39 $filterInputs = [];40 41 continue;42 } else if ($current === TOKEN_INPUT_DELIMITER) {43 $filterInputs[] = $currentSegment;44 $currentSegment = '';45 46 continue;47 } else if ($current === TOKEN_INPUT_START) {48 $filterName = $currentSegment;49 $currentSegment = '';50 $isParsingInput = true;51 52 continue;53 }54 55 $currentSegment .= $current;56}
This new conditional branch is very similar to the TOKEN_INPUT_END
conditional block. We will expand on this conditional block in a later section, but for now we will just consider it a way to reset some working values and advance our iterator to the next character to analyze.
Once the above code has executed our $filters
variable would now contain the following values:
1array(2) { 2 [0]=> 3 array(2) { 4 ["name"]=> 5 string(5) "where" 6 ["input"]=> 7 array(3) { 8 [0]=> 9 string(4) "name"10 [1]=>11 string(1) "="12 [2]=>13 string(5) "value"14 }15 }16 [1]=>17 array(2) {18 ["name"]=>19 string(12) "is:published"20 ["input"]=>21 array(1) {22 [0]=>23 string(4) "true"24 }25 }26}
Our next challenge is going to be allowing for string input values. We will update our input.txt
file to contain the following value:
1where(name, =, 'Simple string input')
If we run our parser now we get the following output:
1array(1) { 2 [0]=> 3 array(2) { 4 ["name"]=> 5 string(5) "where" 6 ["input"]=> 7 array(3) { 8 [0]=> 9 string(4) "name"10 [1]=>11 string(1) "="12 [2]=>13 string(21) "'Simple string input'"14 }15 }16}
It almost looks correct the way it is; the problem is that the final input value includes the TOKEN_STR_DELIMITER
characters at the start and end. These characters are not part of the actual input, and are the way we start and end strings with our simple language. We will need to get rid of these in the output.
We will add a new conditional branch that checks for the TOKEN_STR_DELIMITER
, as well as a new flag $isParsingString
that will change depending on if we've already encountered the TOKEN_STR_DELIMITER
character.
If we encounter the TOKEN_STR_DELIMITER
character and the $isParsingString
flag is false
, we know that we are entering into a new string value; the opposite is true: if we encounter the TOKEN_STR_DELIMITER
character and $isParsingString
is true
we can assume that we have just finished parsing a string input value.
Like with the TOKEN_FILTER_DELIMITER
conditional branch we will reset the $currentSegment
value and advance our iterator (we will also not add the TOKEN_STR_DELIMITER
to the $currentSegment
working value). We will also toggle the value of $isParsingString
within this block:
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split(file_get_contents('input.txt'));11$inputLength = count($tokens);12 13$filters = [];14$filterName = '';15$currentSegment = '';16$filterInputs = [];17$isParsingInput = false;18$isParsingString = false;19 20for ($i = 0; $i < $inputLength; $i++) {21 $current = $tokens[$i];22 23 if ($current === TOKEN_STR_DELIMITER) {24 $isParsingString = !$isParsingString;25 $currentSegment = '';26 27 continue;28 } else if($current === TOKEN_FILTER_DELIMITER) {29 $currentSegment = '';30 $filterName = '';31 $filterInputs = [];32 33 continue;34 } else if ($current === TOKEN_INPUT_END) {35 $filterInputs[] = $currentSegment;36 $currentSegment = '';37 $isParsingInput = false;38 39 $filters[] = [40 'name' => $filterName,41 'input' => array_map('trim', $filterInputs)42 ];43 44 $filterName = '';45 $filterInputs = [];46 47 continue;48 } else if ($current === TOKEN_INPUT_DELIMITER) {49 $filterInputs[] = $currentSegment;50 $currentSegment = '';51 52 continue;53 } else if ($current === TOKEN_INPUT_START) {54 $filterName = $currentSegment;55 $currentSegment = '';56 $isParsingInput = true;57 58 continue;59 }60 61 $currentSegment .= $current;62}
If we run this code now, we will get the following output:
1array(1) { 2 [0]=> 3 array(2) { 4 ["name"]=> 5 string(5) "where" 6 ["input"]=> 7 array(3) { 8 [0]=> 9 string(4) "name"10 [1]=>11 string(1) "="12 [2]=>13 string(0) ""14 }15 }16}
Our last input value has been completely removed now. This is interesting, but has a pretty simple cause and solution. When our TOKEN_STR_DELIMITER
token is encountered it resets the value of $currentSegment
. We need to update this code block to check if we are already parsing a string. If so, we will add the current value of $currentSegment
to our $filterInputs
before resetting things and moving on:
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split(file_get_contents('input.txt'));11$inputLength = count($tokens);12 13$filters = [];14$filterName = '';15$currentSegment = '';16$filterInputs = [];17$isParsingInput = false;18$isParsingString = false;19 20for ($i = 0; $i < $inputLength; $i++) {21 $current = $tokens[$i];22 23 if ($current === TOKEN_STR_DELIMITER) {24 if ($isParsingString) {25 $filterInputs[] = $currentSegment;26 }27 28 $isParsingString = !$isParsingString;29 $currentSegment = '';30 31 continue;32 } else if($current === TOKEN_FILTER_DELIMITER) {33 $currentSegment = '';34 $filterName = '';35 $filterInputs = [];36 37 continue;38 } else if ($current === TOKEN_INPUT_END) {39 $filterInputs[] = $currentSegment;40 $currentSegment = '';41 $isParsingInput = false;42 43 $filters[] = [44 'name' => $filterName,45 'input' => array_map('trim', $filterInputs)46 ];47 48 $filterName = '';49 $filterInputs = [];50 51 continue;52 } else if ($current === TOKEN_INPUT_DELIMITER) {53 $filterInputs[] = $currentSegment;54 $currentSegment = '';55 56 continue;57 } else if ($current === TOKEN_INPUT_START) {58 $filterName = $currentSegment;59 $currentSegment = '';60 $isParsingInput = true;61 62 continue;63 }64 65 $currentSegment .= $current;66}
Our parser output is now:
1array(1) { 2 [0]=> 3 array(2) { 4 ["name"]=> 5 string(5) "where" 6 ["input"]=> 7 array(4) { 8 [0]=> 9 string(4) "name"10 [1]=>11 string(1) "="12 [2]=>13 string(19) "Simple string input"14 [3]=>15 string(0) ""16 }17 }18}
We now have a new problem in that we have an empty string value as a fourth input value that shouldn't be there. We cannot simply check if we have empty strings and remove them, since an empty string value can be completely valid input, and we have no way of making a distinction between valid and invalid empty strings. Why is this happening, though?
Currently our TOKEN_STR_DELIMITER
code block is adding the $currentSegment
value to our $filterInputs
array whenever it is reached and we are already parsing a string value. If we take another look at our input string we can see that the TOKEN_INPUT_END
character follows our TOKEN_STR_DELIMITER
character:
1where(name, =, 'Simple string input')
If we think about this some more, in an earlier section we implemented the TOKEN_INPUT_END
conditional block, which also adds the $currentSegment
value to our $filterInputs
array. This behavior would also be present if another string input value was included in the list after a string because the TOKEN_INPUT_DELIMITER
token would be present:
1where(name, =, 'Simple string input', value)
Because we know that only the TOKEN_INPUT_DELIMITER
and TOKEN_INPUT_END
characters should follow the end of a string delimiter, we can update our TOKEN_STR_DELIMITER
block to simply indicate that we are no longer parsing a string and move on (since one of the following conditional blocks would take care of adding the value to the $filterInputs
array):
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split(file_get_contents('input.txt'));11$inputLength = count($tokens);12 13$filters = [];14$filterName = '';15$currentSegment = '';16$filterInputs = [];17$isParsingInput = false;18$isParsingString = false;19 20for ($i = 0; $i < $inputLength; $i++) {21 $current = $tokens[$i];22 23 if ($current === TOKEN_STR_DELIMITER) {24 if ($isParsingString) {25 $isParsingString = false;26 27 continue;28 }29 30 $isParsingString = true;31 $currentSegment = '';32 33 continue;34 } else if($current === TOKEN_FILTER_DELIMITER) {35 $currentSegment = '';36 $filterName = '';37 $filterInputs = [];38 39 continue;40 } else if ($current === TOKEN_INPUT_END) {41 $filterInputs[] = $currentSegment;42 $currentSegment = '';43 $isParsingInput = false;44 45 $filters[] = [46 'name' => $filterName,47 'input' => array_map('trim', $filterInputs)48 ];49 50 $filterName = '';51 $filterInputs = [];52 53 continue;54 } else if ($current === TOKEN_INPUT_DELIMITER) {55 $filterInputs[] = $currentSegment;56 $currentSegment = '';57 58 continue;59 } else if ($current === TOKEN_INPUT_START) {60 $filterName = $currentSegment;61 $currentSegment = '';62 $isParsingInput = true;63 64 continue;65 }66 67 $currentSegment .= $current;68}
The output of our parser is now what we would expect:
1array(1) { 2 [0]=> 3 array(2) { 4 ["name"]=> 5 string(5) "where" 6 ["input"]=> 7 array(3) { 8 [0]=> 9 string(4) "name"10 [1]=>11 string(1) "="12 [2]=>13 string(19) "Simple string input"14 }15 }16}
In this section we will work on implementing our string escape sequences. We will start by adjusting our input.txt
to have the following content:
1where(name, =, 'Singe quote: \'')
The intended output for the final input value is Single quote: '
. In an earlier section we decided that our string escape character would be a backslash character (\
). When the escape character is encountered we need to check the character that follows, and if it is either a single quote ('
) or another backslash (\
), we will output those characters in the output and move on (we could also do this by detecting a backslash or single quote character and checking what the previous character was).
To do this, we will implement a simple mechanism that provides us access to the next character, or the "peek" character:
1$tokens = mb_str_split(file_get_contents('input.txt')); 2$inputLength = count($tokens); 3 4for ($i = 0; $i < $inputLength; $i++) { 5 $current = $tokens[$i]; 6 $next = null; 7 8 if (($i + 1) < $inputLength) { 9 $next = $tokens[$i + 1];10 }11 12 // ...13}
Now that we have access to our peek character, we can add a new conditional branch to detect the TOKEN_STR_ESCAPE
character. We will create a list of all valid characters that can follow the string escape sequence and check against that, as well as check if we are currently parsing a string:
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split(file_get_contents('input.txt'));11$inputLength = count($tokens);12 13$filters = [];14$filterName = '';15$currentSegment = '';16$filterInputs = [];17$isParsingInput = false;18$isParsingString = false;19 20$validEscapeCharacters = ['\'', '\\'];21 22for ($i = 0; $i < $inputLength; $i++) {23 $current = $tokens[$i];24 $next = null;25 26 if (($i + 1) < $inputLength) {27 $next = $tokens[$i + 1];28 }29 30 if ($current === TOKEN_STR_ESCAPE) {31 if ($isParsingString && $next !== null) {32 if (in_array($next, $validEscapeCharacters)) {33 $currentSegment .= $next;34 $i++;35 continue;36 }37 }38 } else if ($current === TOKEN_STR_DELIMITER) {39 if ($isParsingString) {40 $isParsingString = false;41 42 continue;43 }44 45 $isParsingString = true;46 $currentSegment = '';47 48 continue;49 } else if($current === TOKEN_FILTER_DELIMITER) {50 $currentSegment = '';51 $filterName = '';52 $filterInputs = [];53 54 continue;55 } else if ($current === TOKEN_INPUT_END) {56 $filterInputs[] = $currentSegment;57 $currentSegment = '';58 $isParsingInput = false;59 60 $filters[] = [61 'name' => $filterName,62 'input' => array_map('trim', $filterInputs)63 ];64 65 $filterName = '';66 $filterInputs = [];67 68 continue;69 } else if ($current === TOKEN_INPUT_DELIMITER) {70 $filterInputs[] = $currentSegment;71 $currentSegment = '';72 73 continue;74 } else if ($current === TOKEN_INPUT_START) {75 $filterName = $currentSegment;76 $currentSegment = '';77 $isParsingInput = true;78 79 continue;80 }81 82 $currentSegment .= $current;83}
In our implementation of the TOKEN_STR_ESCAPE
condition branch, you may have noticed that we are adding the $next
character to the $currentSegment
instead of the escape character. This is because the escape character itself is not interesting, only what comes after if followed by a valid character. Additionally, we have added the $i++
to that block before continuing to advance our position by one in the list of input tokens.
If we did not add the $i++
to our TOKEN_STR_ESCAPE
block we would see our desired characters doubled in the output since the next iteration would re-process that character. Running the parser now produces the following, desired, output:
1array(1) { 2 [0]=> 3 array(2) { 4 ["name"]=> 5 string(5) "where" 6 ["input"]=> 7 array(3) { 8 [0]=> 9 string(4) "name"10 [1]=>11 string(1) "="12 [2]=>13 string(14) "Singe quote: '"14 }15 }16}
At the moment we've constructed a simple parser for our filter input language, and works relatively well. However, there are a few improvements we need to make regarding the string implementation. To get started, let's update the value of our input.txt
file to contain the following:
1where(name, =, 'Singe (, | quote: \'')
Running our parser now produces the following output:
1array(1) { 2 [0]=> 3 array(2) { 4 ["name"]=> 5 string(0) "" 6 ["input"]=> 7 array(1) { 8 [0]=> 9 string(8) "quote: '"10 }11 }12}
The output is not what we would expect; we've lost our filter name as well as all meaningful input values. The reason for this is that we've added the (
, ,
, and |
characters to our string input.
Previously, before we implemented our string parser, we had created conditional blocks that check for those characters since they have special meaning in our filter input language. To solve our current issue we simply have to adjust those conditional blocks and add a check to see if we are currently parsing a string. If we are parsing a string, we will add that character to the output; if we are not parsing a string, we will continue with the previous behavior:
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split(file_get_contents('input.txt')); 11$inputLength = count($tokens); 12 13$filters = []; 14$filterName = ''; 15$currentSegment = ''; 16$filterInputs = []; 17$isParsingInput = false; 18$isParsingString = false; 19 20$validEscapeCharacters = ['\'', '\\']; 21 22for ($i = 0; $i < $inputLength; $i++) { 23 $current = $tokens[$i]; 24 $next = null; 25 26 if (($i + 1) < $inputLength) { 27 $next = $tokens[$i + 1]; 28 } 29 30 if ($current === TOKEN_STR_ESCAPE) { 31 if ($isParsingString && $next !== null) { 32 if (in_array($next, $validEscapeCharacters)) { 33 $currentSegment .= $next; 34 $i++; 35 continue; 36 } 37 } 38 } else if ($current === TOKEN_STR_DELIMITER) { 39 if ($isParsingString) { 40 $isParsingString = false; 41 42 continue; 43 } 44 45 $isParsingString = true; 46 $currentSegment = ''; 47 48 continue; 49 } else if($current === TOKEN_FILTER_DELIMITER) { 50 if ($isParsingString) { 51 $currentSegment .= $current; 52 continue; 53 } 54 55 $currentSegment = ''; 56 $filterName = ''; 57 $filterInputs = []; 58 59 continue; 60 } else if ($current === TOKEN_INPUT_END) { 61 if ($isParsingString) { 62 $currentSegment .= $current; 63 continue; 64 } 65 66 $filterInputs[] = $currentSegment; 67 $currentSegment = ''; 68 $isParsingInput = false; 69 70 $filters[] = [ 71 'name' => $filterName, 72 'input' => array_map('trim', $filterInputs) 73 ]; 74 75 $filterName = ''; 76 $filterInputs = []; 77 78 continue; 79 } else if ($current === TOKEN_INPUT_DELIMITER) { 80 if ($isParsingString) { 81 $currentSegment .= $current; 82 continue; 83 } 84 85 $filterInputs[] = $currentSegment; 86 $currentSegment = ''; 87 88 continue; 89 } else if ($current === TOKEN_INPUT_START) { 90 if ($isParsingString) { 91 $currentSegment .= $current; 92 continue; 93 } 94 95 $filterName = $currentSegment; 96 $currentSegment = ''; 97 $isParsingInput = true; 98 99 continue;100 }101 102 $currentSegment .= $current;103}
After this change our parser now produces the expected output:
1array(1) { 2 [0]=> 3 array(2) { 4 ["name"]=> 5 string(5) "where" 6 ["input"]=> 7 array(3) { 8 [0]=> 9 string(4) "name"10 [1]=>11 string(1) "="12 [2]=>13 string(19) "Singe (, | quote: '"14 }15 }16}
Our parser can now understand our simple filter language, and produces the output that we would expect. However, we are able to say it produces the output we expect because we know exactly what the input to the program should be; new users of our filter language are unlikely to be as familiar with it as we are, and are likely to become frustrated when the parser produces inconsistent or unexpected output. To help with this, we will now add some error checking and error messages to our parser.
To start, we will go through and list possible error scenarios. Once we have constructed the list, we will implement the corresponding checks in the code and throw exceptions when an error condition is met. We will not be doing anything with the output of our parser, so all of our error conditions will be a result of encountering unexpected input.
The types of errors we will be reporting to users will generally fall into one of two categories:
The following are examples of possible error scenarios (it is not an exhaustive list):
TOKEN_INPUT_START
is encountered, but we never see a TOKEN_INPUT_END
,TOKEN_INPUT_END
, and have not seen TOKEN_INPUT_START
,TOKEN_INPUT_DELIMITER
, but we are not currently parsing an input list,TOKEN_FILTER_DELIMITER
, but the previous token was not TOKEN_INPUT_END
Some error conditions will be easier to detect if we also have access to the previous character, in addition to the next character, of the input. These types of error conditions include scenarios where the input string begins with an illegal character (any of our delimiter tokens), or when an unexpected character follows a special token (string escape sequences could have also been implemented by checking with the previous character, for example). The following example shows how one might implement some basic error messages for this simple type of parser:
1<?php 2 3const TOKEN_FILTER_DELIMITER = '|'; 4const TOKEN_INPUT_START = '('; 5const TOKEN_INPUT_END = ')'; 6const TOKEN_INPUT_DELIMITER = ','; 7const TOKEN_STR_DELIMITER = '\''; 8const TOKEN_STR_ESCAPE = '\\'; 9 10$tokens = mb_str_split(file_get_contents('input.txt')); 11$inputLength = count($tokens); 12 13$filters = []; 14$filterName = ''; 15$currentSegment = ''; 16$filterInputs = []; 17$isParsingInput = false; 18$isParsingString = false; 19 20$validEscapeCharacters = ['\'', '\\']; 21$stringStartPosition = -1; 22$inputListStartPosition = -1; 23 24for ($i = 0; $i < $inputLength; $i++) { 25 $current = $tokens[$i]; 26 $next = null; 27 $previous = null; 28 29 if ($i > 0) { 30 $previous = $tokens[$i - 1]; 31 } 32 33 if (($i + 1) < $inputLength) { 34 $next = $tokens[$i + 1]; 35 } 36 37 if ($current === TOKEN_FILTER_DELIMITER) { 38 if ($previous === null) { 39 throw new Exception('Cannot start expression with TOKEN_FILTER_DELIMITER'); 40 } else if ($previous !== TOKEN_INPUT_END && !$isParsingString) { 41 throw new Exception('Unexpected TOKEN_FILTER_DELIMITER at character: ' . ($i + 1)); 42 } 43 } else if ($current === TOKEN_INPUT_END && $previous === null) { 44 throw new Exception('Cannot start expression with TOKEN_INPUT_END'); 45 } else if ($current === TOKEN_INPUT_START && $previous === null) { 46 throw new Exception('Cannot start expression with TOKEN_INPUT_START'); 47 } else if ($current === TOKEN_STR_DELIMITER && $previous === null) { 48 throw new Exception('Cannot start expression with TOKEN_STR_DELIMITER'); 49 } else if ($current === TOKEN_INPUT_DELIMITER && $previous === null) { 50 throw new Exception('Cannot start expression with TOKEN_INPUT_DELIMITER'); 51 } 52 53 if ($current === TOKEN_INPUT_START && !$isParsingString && $isParsingInput) { 54 throw new Exception('Unexpected TOKEN_INPUT_START at character: ' . ($i + 1)); 55 } else if ($current === TOKEN_INPUT_END && !$isParsingInput) { 56 throw new Exception('Unexpected TOKEN_INPUT_END at character: '.($i + 1)); 57 } 58 59 if ($next === null) { 60 if ($current !== TOKEN_INPUT_END && $previous !== TOKEN_INPUT_END) { 61 throw new Exception('Reached end of input string without input list.'); 62 } 63 64 if ($isParsingInput && $current !== TOKEN_INPUT_END) { 65 throw new Exception('Unexpected end of input while parsing input list. ' . 66 'Input list started at character: ' . $inputListStartPosition); 67 } else if ($isParsingString) { 68 throw new Exception('Unexpected end of input while parsing string. ' . 69 'String started at character: ' . $stringStartPosition); 70 } 71 } 72 73 if ($current === TOKEN_STR_ESCAPE) { 74 if ($isParsingString && $next !== null) { 75 if (in_array($next, $validEscapeCharacters)) { 76 // Handles case of: where(property, =, '\' 77 if (($i + 2) >= $inputLength) { 78 throw new Exception('Unexpected end of input while parsing string. ' . 79 'String started at character: ' . $stringStartPosition); 80 } 81 82 $currentSegment .= $next; 83 $i++; 84 continue; 85 } else { 86 throw new Exception('Invalid string escape sequence at ' 87 . ($i + 1) . ' ("\\' . $next . '").'); 88 } 89 } 90 } else if ($current === TOKEN_STR_DELIMITER) { 91 if ($isParsingString) { 92 $isParsingString = false; 93 $stringStartPosition = -1; 94 95 continue; 96 } 97 98 $isParsingString = true; 99 $stringStartPosition = $i + 1;100 $currentSegment = '';101 102 continue;103 } else if ($current === TOKEN_FILTER_DELIMITER) {104 if ($isParsingString) {105 $currentSegment .= $current;106 continue;107 }108 109 if ($next === null) {110 throw new Exception('Unexpected end of input. Expecting new filter.');111 }112 113 $currentSegment = '';114 $filterName = '';115 $filterInputs = [];116 117 continue;118 } else if ($current === TOKEN_INPUT_END) {119 if ($isParsingString) {120 $currentSegment .= $current;121 continue;122 }123 124 $filterInputs[] = $currentSegment;125 $currentSegment = '';126 $isParsingInput = false;127 $inputListStartPosition = -1;128 129 $filters[] = [130 'name' => $filterName,131 'input' => array_map('trim', $filterInputs)132 ];133 134 $filterName = '';135 $filterInputs = [];136 137 continue;138 } else if ($current === TOKEN_INPUT_DELIMITER) {139 if ($isParsingString) {140 $currentSegment .= $current;141 continue;142 }143 144 if ($isParsingInput === false) {145 throw new Exception('Unexpected TOKEN_INPUT_DELIMITER ' .146 'outside of input list at character: ' . ($i + 1));147 }148 149 $filterInputs[] = $currentSegment;150 $currentSegment = '';151 152 continue;153 } else if ($current === TOKEN_INPUT_START) {154 if ($isParsingString) {155 $currentSegment .= $current;156 continue;157 }158 159 $filterName = $currentSegment;160 $currentSegment = '';161 $isParsingInput = true;162 $inputListStartPosition = $i + 1;163 164 continue;165 }166 167 $currentSegment .= $current;168}
We started this article with a simple description of a customer lan gauge to help describe filters. Throughout the remainder of the article, we built on basic concepts to not only parse our custom syntax, but also handle some intermediate concepts such as string parsing, including escape sequences. The parser we implemented is a fairly basic parser, but can be expanded on to account for much more interesting scenarios, most of which are well beyond the scope of this article, but may come as follow-ups at a later date.
∎