Implementing a Customer Domain Specific Language Parser in PHP

PHP
36 min read

In this article we are going to take a look at implementing a custom language parser in PHP. The language this parser will analyze will be fairly small, and serve a very specific purpose: to parse custom filter expressions from a string input. These type of specific languages are often referred to as domain specific languages, or DSLs; meaning that they are a custom, often light-weight, language that solves a very particular problem and have no broader scope beyond that.

The syntax we will be writing a parser for is a subset of the filtering syntax implemented by Meerkat, a comment system for Statamic 3 websites. The language we will be parsing looks like this in its most basic form:

where(name, =, 'value')

Each filter is composed of the following components:

  • A filter name (where in the example);
  • A target property (name in the example), if applicable;
  • An optional operand (= in the example), if applicable;
  • Filter input values (value in the example), if applicable

A filter can have a single input value:

is:published(true)

Or an arbitrary large amount of input values:

where(name, =, 'value', 'value2', 'value3', 'value4')

Multiple filters can also be supplied when separated by a pipe (|) character:

filter:one(value)|filter:two(value2)|filter:three(value3)

In early versions of Meerkat's implementation of this language syntax, it was simply a wrapper around PHP's explode function, which it uses to split strings based on the pipe character. This method, however, makes it impossible to supply input similar to the following:

filter('string value with an embedded | character')

Similar basic implementation details also make it impossible to have string inputs that contain a single quote (the following example uses the \' quote escape sequence, as that is the target intended behavior):

filter('string value with an embedded single quote: \' ')

Splitting Our Input String Into Tokens

The first step we will take in implementing our language parser is to split the input up into individual characters (we will refer to them as tokens) so they can be analyzed one-by-one. The first reaction might be to reach for str_split:

$tokens = str_split('input string');

This approach will generally work well with English input, as that input tends to be one byte per character (it is important to remember that str_split splits the text input into an array where each element is one byte). Meerkat has many non-English implementations, so this will not work out so well. To account for this, we will use the mb_str_split function to separate our input string into an array based on the actual byte-length of each individual character:

$tokens = mb_str_split('input string');

The output of this would be the following:

array(12) {
  [0] =>
  string(1) "i"
  [1] =>
  string(1) "n"
  [2] =>
  string(1) "p"
  [3] =>
  string(1) "u"
  [4] =>
  string(1) "t"
  [5] =>
  string(1) " "
  [6] =>
  string(1) "s"
  [7] =>
  string(1) "t"
  [8] =>
  string(1) "r"
  [9] =>
  string(1) "i"
  [10] =>
  string(1) "n"
  [11] =>
  string(1) "g"
}

Great, we now have a way to split our input string into individually analyzable pieces of text. We will now set up some basic scaffolding and work to rebuild our input string by iterating our tokens. In the next code example, we will simply get each character in our tokens list, and add them to a string named $currentSegment:

<?php

$tokens = mb_str_split('input string');
$inputLength = count($tokens);

$currentSegment = '';

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];

    $currentSegment .= $current;
}

While incredibly trivial, that code example will be what we continue to build on for the remainder of this article; each incremental section will contain the code from previous sections to reduce the amount of cross-referencing that needs to happen (at the cost of article length). Let's take a step back from writing code, and take another look at what our input will be and get a better understanding of what exactly that is.

Describing Our Language Input

Before writing any more code, we need to take a moment to understand exactly what it is we are trying to analyze. Let's consider the following example input:

where(name, =, 'Alice\'s name')|where(author, =, '\\')|is:published(true)

This input string contains three filter expressions separated by a pipe (|) character. When the pipe character is used to separate filter expressions we will refer to it as the filter delimiter; when it is used inside a string it will have no special meaning, and continue to be just a pipe character. We will remove these special characters and write each filter expression on their own line:

where(name, =, 'Alice\'s name')
where(author, =, '\\')
is:published(true)

From the perspective of our parser an individual expression is only made up of two parts: a name, and a list of input input (it will be up to other code to interpret these things and make sense of them). The format of a filter expression is then:

<FILTER_NAME><LEFT_PARENTHESIS><INPUT_LIST><RIGHT_PARENTHESIS>

From this format now see we have to more special characters to consider: a left parenthesis ( character when not used inside of a string indicates that a filter name has ended, and the input list is starting, and the right parenthesis ) character indicates that the input list has ended. When the ( character is used to start an input list we will refer to it as the input start character, and when the ) character is used to end an input list we will refer to it as the input end character. Neither of these characters will have special meaning when used inside strings. Next, we will look at the input list itself.

The first filter expression has the following as its input list (the input start and end characters have been removed):

name, =, 'Alice\'s name'

The following is a list of the desired parser output:

name
=
Alice's name

We can see that each value in our input list is separated by a comma (,). When a comma is used to separate input values, we will refer to it as the input delimiter character; again, when it is used inside string input, it will have no special meaning. The last input value is a string, and is a little bit more interesting.

String values will start and end with single quotes ('). When a single quote is used to denote the start or end of a string, it will be referred to as the string delimiter character. String input is a little bit more complicated when it comes to escape sequences (the technique we will utilize to allow strings to contain single quotes, and to account for an edge-case that arises as a side effect of that implementation), and before we get into that, lets take a quick moment to list what we've discovered so far about our input filter expressions:

The parser will only accept strings that begin and end with a single quote, and will not allow for the use of double quotes to denote string input values.

Name Character Description
Filter Delimiter ` `
Input Start ( Denotes the start of an input list
Input End ) Denotes the end of an input list
Input Delimiter , Separates values within an input list
String Delimiter ' Denotes the start or end of a string value

The expression example we've been looking at so far has the string 'Alice\'s name' as one of its input values, and the desired output is Alice's name. Because the single quite (') character is already being used to denote the start and end of a string, the following would be ambiguous since we cannot state anything about this value with confidence (is Alice the string's value, and 's name' is a syntax error, or is it the other way around?):

'Alice's name'

By convention, we will say that when a single quote is proceeded by a backslash (\) character within a string, only the single quote will become part of the string's value. When a backslash is used as a way to indicate special output within a string, we will refer to it as the escape character. When a valid character follows the escape character, such as a single quote, we will refer to both the backslash and that next character as an escape sequence.

Later in this article we will run into an implementation issue where it becomes difficult to output string values that contain a single backslash character, or the literal value of a backslash followed by a single quote (without intending it to be an escape sequence):

\'

To allow for this, we will define the escape sequence \\ (the escape character followed by a literal \ character) to output a single backslash.

Our list of special characters has now grown to this:

Name Character Description
Filter Delimiter ` `
Input Start ( Denotes the start of an input list
Input End ) Denotes the end of an input list
Input Delimiter , Separates values within an input list
String Delimiter ' Denotes the start or end of a string value
Escape Character \ Starts an escape sequence within a string value

And our list of valid escape sequences is:

Sequence Desired Output
\' '
\\ \

We will consider all of these characters special, and unless they are used within a string (and with the correct escape sequence), they will not be available for use in filter names or non-string input values. With this in mind, the following filter name would be invalid:

in,v|a(lid):name(input, list)

Defining Our Output

Now that we have a better understanding of our input, it would be a good time to decide what we want the output of our parser to be. In the previous section we decided that, from the perspective of the parser, each individual filter expression is composed of a name and a list of input values. The input may contain multiple filter expressions. There are many different ways we could represent this, but we will use a simple associative array to represent a single filter:

[
    'name' => 'filter:name',
    'input' => [
        'list',
        'of',
        'input',
        'values'
    ]
]

The final output of the parser would just be an array of arrays with that form.

Parsing a Single Input Filter

We will start our parser implementation by defining our characters as constants. We could simply write them as strings each time we need them in our code, but it is easy to make a mistake that way once things get going. We will build on the code that we started the article with:

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split('input string');
$inputLength = count($tokens);

$currentSegment = '';

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];

    $currentSegment .= $current;
}

To start, we are going to change how we are getting our input for the mb_str_split function just so we can avoid looking at PHP's string escape sequences while we attempt to build our own later. To do this, we will create an input.txt file alongside our PHP file with the following contents (we are going to use simple input for now to get started):

where(name, =, value)

We will also update the code to read from this file:

$tokens = mb_str_split(file_get_contents('input.txt'));

Great, we can now modify our input string without confusing ourselves by mixing our custom string escape sequences and the PHP escape sequences. Now the question is where do we even begin?

We will start simple and attempt to get the filter name separated from the rest of the input string. In the previous section when we described we know that the filter name is the text that appears before the input start character ((). To extract this, we can keep concatenating the $currentSegment string until we reach that character:

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

$filterName = '';
$currentSegment = '';

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];

    if ($current === TOKEN_INPUT_START) {
        $filterName = $currentSegment;
        $currentSegment = '';

        continue;
    }

    $currentSegment .= $current;
}

After this has executed, the $filterName variable would contain the value where, and the $currentSegment variable would contain the value name, =, value). You may have also noticed that once we hit the TOKEN_INPUT_START character we reset the value of $currentSegment. We will do this whenever we know what type of data is contained in the $currentSegment variable (in this instance, it was the filter's name). The next step will be to start parsing the input list.

To parse the input list, we will continue to iterate each of the remaining characters. Once we hit the TOKEN_INPUT_DELIMITER character we will add the value of the $currentSegment to a new list we will create. We will also take this time to update our existing code to add a new $isParsingInput flag we will make use of later:

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

$filterName = '';
$currentSegment = '';
$filterInputs = [];
$isParsingInput = false;

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];

    if ($current === TOKEN_INPUT_DELIMITER) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';

        continue;
    } else if ($current === TOKEN_INPUT_START) {
        $filterName = $currentSegment;
        $currentSegment = '';
        $isParsingInput = true;

        continue;
    }

    $currentSegment .= $current;
}

Once this code has finished executing, the $filterInputs array will contain the following values:

  • name
  • =

This isn't quite right, as we also expect the value input value to be part of that list. It is not currently in the list since our loop has finished iterating the array without taking any additional actions since no conditions were met. We will fix this by adding a new condition that checks for the TOKEN_INPUT_END token:

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

$filterName = '';
$currentSegment = '';
$filterInputs = [];
$isParsingInput = false;

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];

    if ($current === TOKEN_INPUT_END) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';
        $isParsingInput = false;

        continue;
    } else if ($current === TOKEN_INPUT_DELIMITER) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';

        continue;
    } else if ($current === TOKEN_INPUT_START) {
        $filterName = $currentSegment;
        $currentSegment = '';
        $isParsingInput = true;

        continue;
    }

    $currentSegment .= $current;
}

Now once our code executes, the value of $filterName would be where and $filterInputs would contain the following items:

  • name
  • =
  • value

Our next step will be to create the array that will represent a single filter in our input. We know from our input description that an individual filter ends when the TOKEN_INPUT_END token is detected.

When we detect the TOKEN_INPUT_END token we will create a new array containing our filter's name and input values (we will also trim all of the input values at this point to remove extra whitespace):

Trimming the whitespace of the input values is not a technical requirement of the parser itself, and is purely a convenience step to clean up the input values. We could have expanded the list of tokens we are analyzing to include whitespace, and handle it in a better way, but it is not a requirement for this particular use case.

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

$filters = [];
$filterName = '';
$currentSegment = '';
$filterInputs = [];
$isParsingInput = false;

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];

    if ($current === TOKEN_INPUT_END) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';
        $isParsingInput = false;

        $filters[] = [
            'name' => $filterName,
            'input' => array_map('trim', $filterInputs)
        ];

        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_DELIMITER) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';

        continue;
    } else if ($current === TOKEN_INPUT_START) {
        $filterName = $currentSegment;
        $currentSegment = '';
        $isParsingInput = true;

        continue;
    }

    $currentSegment .= $current;
}

Once our code has executed, our $filters variable would contain the following value:

array(1) {
  [0]=>
  array(2) {
    ["name"]=>
    string(5) "where"
    ["input"]=>
    array(3) {
      [0]=>
      string(4) "name"
      [1]=>
      string(1) "="
      [2]=>
      string(5) "value"
    }
  }
}

Parsing Multiple Input Filters

After the previous section, our parser can now extract a single input filter from the input string. Let's update the input.txt with the following value:

where(name, =, value)|is:published(true)

We will add a new condition to our parser that checks if the current token is the TOKEN_FILTER_DELIMITER token; if it is, we will reset our current values and move on without doing anything with the $current character:

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

$filters = [];
$filterName = '';
$currentSegment = '';
$filterInputs = [];
$isParsingInput = false;

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];

    if($current === TOKEN_FILTER_DELIMITER) {
        $currentSegment = '';
        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_END) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';
        $isParsingInput = false;

        $filters[] = [
            'name' => $filterName,
            'input' => array_map('trim', $filterInputs)
        ];

        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_DELIMITER) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';

        continue;
    } else if ($current === TOKEN_INPUT_START) {
        $filterName = $currentSegment;
        $currentSegment = '';
        $isParsingInput = true;

        continue;
    }

    $currentSegment .= $current;
}

This new conditional branch is very similar to the TOKEN_INPUT_END conditional block. We will expand on this conditional block in a later section, but for now we will just consider it a way to reset some working values and advance our iterator to the next character to analyze.

Once the above code has executed our $filters variable would now contain the following values:

array(2) {
  [0]=>
  array(2) {
    ["name"]=>
    string(5) "where"
    ["input"]=>
    array(3) {
      [0]=>
      string(4) "name"
      [1]=>
      string(1) "="
      [2]=>
      string(5) "value"
    }
  }
  [1]=>
  array(2) {
    ["name"]=>
    string(12) "is:published"
    ["input"]=>
    array(1) {
      [0]=>
      string(4) "true"
    }
  }
}

Parsing String Input Values

Our next challenge is going to be allowing for string input values. We will update our input.txt file to contain the following value:

where(name, =, 'Simple string input')

If we run our parser now we get the following output:

array(1) {
  [0]=>
  array(2) {
    ["name"]=>
    string(5) "where"
    ["input"]=>
    array(3) {
      [0]=>
      string(4) "name"
      [1]=>
      string(1) "="
      [2]=>
      string(21) "'Simple string input'"
    }
  }
}

It almost looks correct the way it is; the problem is that the final input value includes the TOKEN_STR_DELIMITER characters at the start and end. These characters are not part of the actual input, and are the way we start and end strings with our simple language. We will need to get rid of these in the output.

We will add a new conditional branch that checks for the TOKEN_STR_DELIMITER, as well as a new flag $isParsingString that will change depending on if we've already encountered the TOKEN_STR_DELIMITER character.

If we encounter the TOKEN_STR_DELIMITER character and the $isParsingString flag is false, we know that we are entering into a new string value; the opposite is true: if we encounter the TOKEN_STR_DELIMITER character and $isParsingString is true we can assume that we have just finished parsing a string input value.

Like with the TOKEN_FILTER_DELIMITER conditional branch we will reset the $currentSegment value and advance our iterator (we will also not add the TOKEN_STR_DELIMITER to the $currentSegment working value). We will also toggle the value of $isParsingString within this block:

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

$filters = [];
$filterName = '';
$currentSegment = '';
$filterInputs = [];
$isParsingInput = false;
$isParsingString = false;

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];

    if ($current === TOKEN_STR_DELIMITER) {
        $isParsingString = !$isParsingString;
        $currentSegment = '';

        continue;
    } else if($current === TOKEN_FILTER_DELIMITER) {
        $currentSegment = '';
        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_END) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';
        $isParsingInput = false;

        $filters[] = [
            'name' => $filterName,
            'input' => array_map('trim', $filterInputs)
        ];

        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_DELIMITER) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';

        continue;
    } else if ($current === TOKEN_INPUT_START) {
        $filterName = $currentSegment;
        $currentSegment = '';
        $isParsingInput = true;

        continue;
    }

    $currentSegment .= $current;
}

If we run this code now, we will get the following output:

array(1) {
  [0]=>
  array(2) {
    ["name"]=>
    string(5) "where"
    ["input"]=>
    array(3) {
      [0]=>
      string(4) "name"
      [1]=>
      string(1) "="
      [2]=>
      string(0) ""
    }
  }
}

Our last input value has been completely removed now. This is interesting, but has a pretty simple cause and solution. When our TOKEN_STR_DELIMITER token is encountered it resets the value of $currentSegment. We need to update this code block to check if we are already parsing a string. If so, we will add the current value of $currentSegment to our $filterInputs before resetting things and moving on:

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

$filters = [];
$filterName = '';
$currentSegment = '';
$filterInputs = [];
$isParsingInput = false;
$isParsingString = false;

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];

    if ($current === TOKEN_STR_DELIMITER) {
        if ($isParsingString) {
            $filterInputs[] = $currentSegment;
        }

        $isParsingString = !$isParsingString;
        $currentSegment = '';

        continue;
    } else if($current === TOKEN_FILTER_DELIMITER) {
        $currentSegment = '';
        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_END) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';
        $isParsingInput = false;

        $filters[] = [
            'name' => $filterName,
            'input' => array_map('trim', $filterInputs)
        ];

        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_DELIMITER) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';

        continue;
    } else if ($current === TOKEN_INPUT_START) {
        $filterName = $currentSegment;
        $currentSegment = '';
        $isParsingInput = true;

        continue;
    }

    $currentSegment .= $current;
}

Our parser output is now:

array(1) {
  [0]=>
  array(2) {
    ["name"]=>
    string(5) "where"
    ["input"]=>
    array(4) {
      [0]=>
      string(4) "name"
      [1]=>
      string(1) "="
      [2]=>
      string(19) "Simple string input"
      [3]=>
      string(0) ""
    }
  }
}

We now have a new problem in that we have an empty string value as a fourth input value that shouldn't be there. We cannot simply check if we have empty strings and remove them, since an empty string value can be completely valid input, and we have no way of making a distinction between valid and invalid empty strings. Why is this happening, though?

Currently our TOKEN_STR_DELIMITER code block is adding the $currentSegment value to our $filterInputs array whenever it is reached and we are already parsing a string value. If we take another look at our input string we can see that the TOKEN_INPUT_END character follows our TOKEN_STR_DELIMITER character:

where(name, =, 'Simple string input')

If we think about this some more, in an earlier section we implemented the TOKEN_INPUT_END conditional block, which also adds the $currentSegment value to our $filterInputs array. This behavior would also be present if another string input value was included in the list after a string because the TOKEN_INPUT_DELIMITER token would be present:

where(name, =, 'Simple string input', value)

Because we know that only the TOKEN_INPUT_DELIMITER and TOKEN_INPUT_END characters should follow the end of a string delimiter, we can update our TOKEN_STR_DELIMITER block to simply indicate that we are no longer parsing a string and move on (since one of the following conditional blocks would take care of adding the value to the $filterInputs array):

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

$filters = [];
$filterName = '';
$currentSegment = '';
$filterInputs = [];
$isParsingInput = false;
$isParsingString = false;

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];

    if ($current === TOKEN_STR_DELIMITER) {
        if ($isParsingString) {
            $isParsingString = false;

            continue;
        }

        $isParsingString = true;
        $currentSegment = '';

        continue;
    } else if($current === TOKEN_FILTER_DELIMITER) {
        $currentSegment = '';
        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_END) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';
        $isParsingInput = false;

        $filters[] = [
            'name' => $filterName,
            'input' => array_map('trim', $filterInputs)
        ];

        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_DELIMITER) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';

        continue;
    } else if ($current === TOKEN_INPUT_START) {
        $filterName = $currentSegment;
        $currentSegment = '';
        $isParsingInput = true;

        continue;
    }

    $currentSegment .= $current;
}

The output of our parser is now what we would expect:

array(1) {
  [0]=>
  array(2) {
    ["name"]=>
    string(5) "where"
    ["input"]=>
    array(3) {
      [0]=>
      string(4) "name"
      [1]=>
      string(1) "="
      [2]=>
      string(19) "Simple string input"
    }
  }
}

Implementing String Escape Sequences

In this section we will work on implementing our string escape sequences. We will start by adjusting our input.txt to have the following content:

where(name, =, 'Singe quote: \'')

The intended output for the final input value is Single quote: '. In an earlier section we decided that our string escape character would be a backslash character (\). When the escape character is encountered we need to check the character that follows, and if it is either a single quote (') or another backslash (\), we will output those characters in the output and move on (we could also do this by detecting a backslash or single quote character and checking what the previous character was).

To do this, we will implement a simple mechanism that provides us access to the next character, or the "peek" character:

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];
    $next = null;

    if (($i + 1) < $inputLength) {
        $next = $tokens[$i + 1];
    }

    // ...
}

Now that we have access to our peek character, we can add a new conditional branch to detect the TOKEN_STR_ESCAPE character. We will create a list of all valid characters that can follow the string escape sequence and check against that, as well as check if we are currently parsing a string:

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

$filters = [];
$filterName = '';
$currentSegment = '';
$filterInputs = [];
$isParsingInput = false;
$isParsingString = false;

$validEscapeCharacters = ['\'', '\\'];

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];
    $next = null;

    if (($i + 1) < $inputLength) {
        $next = $tokens[$i + 1];
    }

    if ($current === TOKEN_STR_ESCAPE) {
        if ($isParsingString && $next !== null) {
            if (in_array($next, $validEscapeCharacters)) {
                $currentSegment .= $next;
                $i++;
                continue;
            }
        }
    } else if ($current === TOKEN_STR_DELIMITER) {
        if ($isParsingString) {
            $isParsingString = false;

            continue;
        }

        $isParsingString = true;
        $currentSegment = '';

        continue;
    } else if($current === TOKEN_FILTER_DELIMITER) {
        $currentSegment = '';
        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_END) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';
        $isParsingInput = false;

        $filters[] = [
            'name' => $filterName,
            'input' => array_map('trim', $filterInputs)
        ];

        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_DELIMITER) {
        $filterInputs[] = $currentSegment;
        $currentSegment = '';

        continue;
    } else if ($current === TOKEN_INPUT_START) {
        $filterName = $currentSegment;
        $currentSegment = '';
        $isParsingInput = true;

        continue;
    }

    $currentSegment .= $current;
}

In our implementation of the TOKEN_STR_ESCAPE condition branch, you may have noticed that we are adding the $next character to the $currentSegment instead of the escape character. This is because the escape character itself is not interesting, only what comes after if followed by a valid character. Additionally, we have added the $i++ to that block before continuing to advance our position by one in the list of input tokens.

If we did not add the $i++ to our TOKEN_STR_ESCAPE block we would see our desired characters doubled in the output since the next iteration would re-process that character. Running the parser now produces the following, desired, output:

array(1) {
  [0]=>
  array(2) {
    ["name"]=>
    string(5) "where"
    ["input"]=>
    array(3) {
      [0]=>
      string(4) "name"
      [1]=>
      string(1) "="
      [2]=>
      string(14) "Singe quote: '"
    }
  }
}

Improving the String Implementation

At the moment we've constructed a simple parser for our filter input language, and works relatively well. However, there are a few improvements we need to make regarding the string implementation. To get started, let's update the value of our input.txt file to contain the following:

where(name, =, 'Singe (, | quote: \'')

Running our parser now produces the following output:

array(1) {
  [0]=>
  array(2) {
    ["name"]=>
    string(0) ""
    ["input"]=>
    array(1) {
      [0]=>
      string(8) "quote: '"
    }
  }
}

The output is not what we would expect; we've lost our filter name as well as all meaningful input values. The reason for this is that we've added the (, ,, and | characters to our string input.

Previously, before we implemented our string parser, we had created conditional blocks that check for those characters since they have special meaning in our filter input language. To solve our current issue we simply have to adjust those conditional blocks and add a check to see if we are currently parsing a string. If we are parsing a string, we will add that character to the output; if we are not parsing a string, we will continue with the previous behavior:

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

$filters = [];
$filterName = '';
$currentSegment = '';
$filterInputs = [];
$isParsingInput = false;
$isParsingString = false;

$validEscapeCharacters = ['\'', '\\'];

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];
    $next = null;

    if (($i + 1) < $inputLength) {
        $next = $tokens[$i + 1];
    }

    if ($current === TOKEN_STR_ESCAPE) {
        if ($isParsingString && $next !== null) {
            if (in_array($next, $validEscapeCharacters)) {
                $currentSegment .= $next;
                $i++;
                continue;
            }
        }
    } else if ($current === TOKEN_STR_DELIMITER) {
        if ($isParsingString) {
            $isParsingString = false;

            continue;
        }

        $isParsingString = true;
        $currentSegment = '';

        continue;
    } else if($current === TOKEN_FILTER_DELIMITER) {
        if ($isParsingString) {
            $currentSegment .= $current;
            continue;
        }

        $currentSegment = '';
        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_END) {
        if ($isParsingString) {
            $currentSegment .= $current;
            continue;
        }

        $filterInputs[] = $currentSegment;
        $currentSegment = '';
        $isParsingInput = false;

        $filters[] = [
            'name' => $filterName,
            'input' => array_map('trim', $filterInputs)
        ];

        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_DELIMITER) {
        if ($isParsingString) {
            $currentSegment .= $current;
            continue;
        }

        $filterInputs[] = $currentSegment;
        $currentSegment = '';

        continue;
    } else if ($current === TOKEN_INPUT_START) {
        if ($isParsingString) {
            $currentSegment .= $current;
            continue;
        }
        
        $filterName = $currentSegment;
        $currentSegment = '';
        $isParsingInput = true;

        continue;
    }

    $currentSegment .= $current;
}

After this change our parser now produces the expected output:

array(1) {
  [0]=>
  array(2) {
    ["name"]=>
    string(5) "where"
    ["input"]=>
    array(3) {
      [0]=>
      string(4) "name"
      [1]=>
      string(1) "="
      [2]=>
      string(19) "Singe (, | quote: '"
    }
  }
}

Adding Error Messages to Our Parser

Our parser can now understand our simple filter language, and produces the output that we would expect. However, we are able to say it produces the output we expect because we know exactly what the input to the program should be; new users of our filter language are unlikely to be as familiar with it as we are, and are likely to become frustrated when the parser produces inconsistent or unexpected output. To help with this, we will now add some error checking and error messages to our parser.

To start, we will go through and list possible error scenarios. Once we have constructed the list, we will implement the corresponding checks in the code and throw exceptions when an error condition is met. We will not be doing anything with the output of our parser, so all of our error conditions will be a result of encountering unexpected input.

The types of errors we will be reporting to users will generally fall into one of two categories:

  • Unexpected character while parsing,
  • Unexpected end of input

The following are examples of possible error scenarios (it is not an exhaustive list):

  • The TOKEN_INPUT_START is encountered, but we never see a TOKEN_INPUT_END,
  • We encounter a TOKEN_INPUT_END, and have not seen TOKEN_INPUT_START,
  • We encounter a TOKEN_INPUT_DELIMITER, but we are not currently parsing an input list,
  • We encounter a TOKEN_FILTER_DELIMITER, but the previous token was not TOKEN_INPUT_END
  • A string is started, but is never finished,
  • An invalid string escape sequence is detected

Some error conditions will be easier to detect if we also have access to the previous character, in addition to the next character, of the input. These types of error conditions include scenarios where the input string begins with an illegal character (any of our delimiter tokens), or when an unexpected character follows a special token (string escape sequences could have also been implemented by checking with the previous character, for example). The following example shows how one might implement some basic error messages for this simple type of parser:

<?php

const TOKEN_FILTER_DELIMITER = '|';
const TOKEN_INPUT_START = '(';
const TOKEN_INPUT_END = ')';
const TOKEN_INPUT_DELIMITER = ',';
const TOKEN_STR_DELIMITER = '\'';
const TOKEN_STR_ESCAPE = '\\';

$tokens = mb_str_split(file_get_contents('input.txt'));
$inputLength = count($tokens);

$filters = [];
$filterName = '';
$currentSegment = '';
$filterInputs = [];
$isParsingInput = false;
$isParsingString = false;

$validEscapeCharacters = ['\'', '\\'];
$stringStartPosition = -1;
$inputListStartPosition = -1;

for ($i = 0; $i < $inputLength; $i++) {
    $current = $tokens[$i];
    $next = null;
    $previous = null;

    if ($i > 0) {
        $previous = $tokens[$i - 1];
    }

    if (($i + 1) < $inputLength) {
        $next = $tokens[$i + 1];
    }

    if ($current === TOKEN_FILTER_DELIMITER) {
        if ($previous === null) {
            throw new Exception('Cannot start expression with TOKEN_FILTER_DELIMITER');
        } else if ($previous !== TOKEN_INPUT_END && !$isParsingString) {
            throw new Exception('Unexpected TOKEN_FILTER_DELIMITER at character: ' . ($i + 1));
        }
    } else if ($current === TOKEN_INPUT_END && $previous === null) {
        throw new Exception('Cannot start expression with TOKEN_INPUT_END');
    } else if ($current === TOKEN_INPUT_START && $previous === null) {
        throw new Exception('Cannot start expression with TOKEN_INPUT_START');
    } else if ($current === TOKEN_STR_DELIMITER && $previous === null) {
        throw new Exception('Cannot start expression with TOKEN_STR_DELIMITER');
    } else if ($current === TOKEN_INPUT_DELIMITER && $previous === null) {
        throw new Exception('Cannot start expression with TOKEN_INPUT_DELIMITER');
    }

    if ($current === TOKEN_INPUT_START && !$isParsingString && $isParsingInput) {
        throw new Exception('Unexpected TOKEN_INPUT_START at character: ' . ($i + 1));
    } else if ($current === TOKEN_INPUT_END && !$isParsingInput) {
        throw new Exception('Unexpected TOKEN_INPUT_END at character: '.($i + 1));
    }

    if ($next === null) {
        if ($current !== TOKEN_INPUT_END && $previous !== TOKEN_INPUT_END) {
            throw new Exception('Reached end of input string without input list.');
        }

        if ($isParsingInput && $current !== TOKEN_INPUT_END) {
            throw new Exception('Unexpected end of input while parsing input list. ' .
                'Input list started at character: ' . $inputListStartPosition);
        } else if ($isParsingString) {
            throw new Exception('Unexpected end of input while parsing string. ' .
                'String started at character: ' . $stringStartPosition);
        }
    }

    if ($current === TOKEN_STR_ESCAPE) {
        if ($isParsingString && $next !== null) {
            if (in_array($next, $validEscapeCharacters)) {
                // Handles case of: where(property, =, '\'
                if (($i + 2) >= $inputLength) {
                    throw new Exception('Unexpected end of input while parsing string. ' .
                        'String started at character: ' . $stringStartPosition);
                }

                $currentSegment .= $next;
                $i++;
                continue;
            } else {
                throw new Exception('Invalid string escape sequence at '
                    . ($i + 1) . ' ("\\' . $next . '").');
            }
        }
    } else if ($current === TOKEN_STR_DELIMITER) {
        if ($isParsingString) {
            $isParsingString = false;
            $stringStartPosition = -1;

            continue;
        }

        $isParsingString = true;
        $stringStartPosition = $i + 1;
        $currentSegment = '';

        continue;
    } else if ($current === TOKEN_FILTER_DELIMITER) {
        if ($isParsingString) {
            $currentSegment .= $current;
            continue;
        }

        if ($next === null) {
            throw new Exception('Unexpected end of input. Expecting new filter.');
        }

        $currentSegment = '';
        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_END) {
        if ($isParsingString) {
            $currentSegment .= $current;
            continue;
        }

        $filterInputs[] = $currentSegment;
        $currentSegment = '';
        $isParsingInput = false;
        $inputListStartPosition = -1;

        $filters[] = [
            'name' => $filterName,
            'input' => array_map('trim', $filterInputs)
        ];

        $filterName = '';
        $filterInputs = [];

        continue;
    } else if ($current === TOKEN_INPUT_DELIMITER) {
        if ($isParsingString) {
            $currentSegment .= $current;
            continue;
        }

        if ($isParsingInput === false) {
            throw new Exception('Unexpected TOKEN_INPUT_DELIMITER ' .
                'outside of input list at character: ' . ($i + 1));
        }

        $filterInputs[] = $currentSegment;
        $currentSegment = '';

        continue;
    } else if ($current === TOKEN_INPUT_START) {
        if ($isParsingString) {
            $currentSegment .= $current;
            continue;
        }

        $filterName = $currentSegment;
        $currentSegment = '';
        $isParsingInput = true;
        $inputListStartPosition = $i + 1;

        continue;
    }

    $currentSegment .= $current;
}

Wrapping Up

We started this article with a simple description of a customer lan gauge to help describe filters. Throughout the remainder of the article, we built on basic concepts to not only parse our custom syntax, but also handle some intermediate concepts such as string parsing, including escape sequences. The parser we implemented is a fairly basic parser, but can be expanded on to account for much more interesting scenarios, most of which are well beyond the scope of this article, but may come as follow-ups at a later date.

If you found this article interesting, or helpful, consider sharing it with your friends and colleagues!

Comments

There are no comments. Be the first to comment!

Up next