Things are moving fast! Below you will find various updates that didn't make it to their own blog post.
#Backtracking Support
Forte's parser now supports backtracking. This allows the parser to recover from a wide variety of scenarios, ensuring that the produced AST is as helpful as it can possibly be.
As an example, let's consider the following template:
1<!doctype html>
2<html>
3<body>
4 @can ('update', $post)
5</body>
6</html>
The parser understands that the @can
directive should be paired with something, and it will attempt to find its closing directive. Since it does not exist, the parser will consume all the tokens until it reaches the end of the document.
At first glance, this might seem okay, but since there are no tokens left, the HTML tags will no longer be paired correctly. To resolve this issue, the parser will instead first attempt to parse until it finds the correct stopping point. If it doesn't see it, it rewinds itself, emits a normal directive node, and continues parsing within the original context.
This is the AST without backtracking:
This is the AST produced with backtracking:
That's a lot better!
#Metadata Updates
Emitted nodes contain metadata; of particular interest for this post is the offset positions and document content within the source template.
The work on this project began with accurately tracking the token's offset positions: where the tokens and eventual nodes originate within the template. Currently, these are just character offsets, but a future iteration will also include line and column numbers (this information would be particularly useful for building an incremental compiler and generating accurate source maps).
The calculated offsets are then used to retrieve the source content for each node within the template. All of this information is useful in various situations. One recent internal area I am using the offset information in is to determine "runs" of nodes when parsing attributes.
An attribute node run is indicated by there being no whitespace between nodes:
1<div
2 {{ $dynamic }}-parts-{{ $for }}-everyone="the-value"
3>
4
5</div>
When this is detected, the parser will roll all of these up into a single AST node, making it much simpler for the AST consumer.
Then, there is the document content. This can be useful internally to help make informed decisions, especially when pairing more complex dynamic HTML tags. Other benefits include improving the test suite and the ability to reconstruct the original document exactly from the AST.
#Updating Identifiers
Each node Forte emits has a unique identifier. While prototyping, this has just been set to a GUID. These identifiers help me debug and track down issues when visualizing the AST, as well as ensure that each node is actually unique.
The GUID technique works, but I have since refactored this to a predictable identifier, built up based on the depth of the current node. The main reason for this change is to make the identifiers predictable across parser runs.
After considerable effort, the core tokenizer and parser are largely complete, and I need to move on to handling more edge cases and implementing fault tolerance.
However, before diving into that, I want to add a ton of test cases to ensure the parser continues to work as expected for happy-path templates. This task is begging for snapshot testing; having random identifiers would make this quite challenging.
#HTML Parser Validation
I've also started processing the validation and improvement of the HTML parser. To start, I've downloaded close to 100MB of random HTML files to see what happens. I could create my own test files, but I prefer to gather random content from the Internet as inspiration, as I will uncover things I definitely did not expect.
And it worked! The parser failed on the first random file, but I've resolved the issues, and the parser now processes all the samples I've collected.
After the updates, the parser can now handle unfriendly input like this:
1<!doctype html>
2
3<script src=/some/src.js></script>
4<script src=/some/other/src.js></script
5<script src="another/src.js"></script>
6
7<script>
8func(`<script> (async () => {` + `</scr` + `ipt>'");``);
9</script>
And the AST:
Did you notice how the parser repaired that second script
tag? That was fun.
It will take some time to verify correctness and update the test suite, but things are progressing nicely.
That's all for this update!
∎