My focus this week with Forte has been continuing to work on the HTML tokenizer and parser. I've done enough Blade parsers over the years, where the "Blade" parts are largely a solved problem, and it won't be much work for me to shore up that side of things. Best to focus on the unknowns and bring it all home later.
HTML, though, is its own special kind of fun, so allow me to ramble a bit in this post while I work through some things.
#What I've Got So Far
At the moment, the basic things you'd expect from an HTML parser are working: elements, attributes, void elements, self-closing tags, inner children, all that fun stuff. Even some extra fiddly bits like CDATA and XML declarations (like <!ATTRIBUTE
, <!ENTITY
, etc.). All of that is relatively simple in isolation. However, things start to go off the rails fast once you want to parse inline Blade, as well as contextualize what you've parsed.
Let's take a look at this simple example:
1<div class="mt-4">
2 <p>Hello, {{ $name }}!</p>
3 <p>This is some <span>nested</span> content.</p>
4</div>
and the AST produced:
Well, that's not terrible! And neither is this:
1<div class="mt-4" {{ $attributes }}>
2 <p>Hello, world.</p>
3</div>
Things get a little more complicated with input like this, but it's not the end of the world to figure out how to tokenize and parse it:
1<div
2 class="mt-4"
3 @if ($disabled) aria-disabled="true" @endif
4>
5 @if ($disabled) Some disabled content. @endif
6</div>
#Blade Components vs. Flux Components vs. Livewire Components vs. 🥷
What is the difference between them? This is a topic I've put an almost unreasonable amount of thought into. We have HTML elements, Blade, Flux, Livewire, and Dagger components, among others. Syntactically, there is not a vast difference between these; they all resemble HTML tags.
1<div
2 class="mt-4"
3 {{ $attributes }}
4>
5 <p>Some content.</p>
6</div>
7
8<livewire:post-item :$post :key="$post->id" />
9
10<x-alert title="The alert title." />
11
12<flux:badge color="lime">New</flux:badge>
Sure, Blade components and their friends have a few extra syntactical goodies, but not that many. In previous projects, I've created special Component
nodes to represent Blade components. However, this approach starts to feel clunky when you want to parse custom components, and adding a ton of extra node types also doesn't feel correct.
With Forte, I've decided that they will all just be elements. To do this, I've relaxed the HTML parser, and it will now parse Blade attributes just fine in HTML elements:
1<div :$title>
2 <p>Some content here.</p>
3</div>
At first, I wasn't sure if I liked this idea, but the more I've thought about it, the more I like it: it opens up several doors to implement some really awesome pre-compilers in the future. Not too mention the consistency in the AST:
Less baked-in opinions and more flexibility.
#What Even is an Element Name
Buckle up. This is where things get interesting, and the main inspiration for long stretches of thinking without coding, which may be surprising since it should be simple: it's the identifier of the tag!
It is indeed simple when it's a regular HTML tag like so:
1<div>...</div>
Slightly less simple if we have something like this:
1<{{ $element }} class="mt-4">
2 ...
3</{{ $element }}>
Here we're going down dark paths:
1<@if ($condition)div@endif class="mt-4">
2 ...
3</@if ($condition)div@endif>
And this. This is just deranged:
1<@foreach (['d', 'i', 'v'] as $char){{ $char }}@endforeach class="mt-4">
2 <p>Inner Content</p>
3</@foreach (['d', 'i', 'v'] as $char){{ $char }}@endforeach>
4
5<div>More Elements</div>
However, I've been working hard to keep things as generic as possible, and the Forte tokenizer and parser can now handle even that:
#Wrapping Up
I still have a lot of work to do, but things are progressing nicely. A few things next up on my list:
- Continue shoring up the HTML parser,
- Handle a few edge cases with element children,
- Add extensive test coverage for the HTML parser,
- Parser fuzzing
I'll try to do blog posts for most of these things, but I am not promising a post for every working session.
∎