WEBVTT

00:00.000 --> 00:09.000
The next thought would be about stronger interrupts from H.T.M.N.N.N.

00:09.000 --> 00:12.000
And that is between ideas.

00:12.000 --> 00:14.000
Thank you very much.

00:14.000 --> 00:19.000
All right.

00:19.000 --> 00:25.000
I am a sponsored contributor to WordPress, which is why I'm giving this talk.

00:25.000 --> 00:29.000
My name is Dennis Nell, and I'm very passionate about this topic.

00:29.000 --> 00:33.000
I would love to talk with you afterwards in the hall, or later over a beer.

00:33.000 --> 00:38.000
I'm going to run really fast, because I couldn't cut enough out of this talk.

00:38.000 --> 00:40.000
A quick note at the beginning.

00:40.000 --> 00:45.000
There's going to be a lot of PHP code in the slides, but this talk is really not about PHP

00:45.000 --> 00:47.000
nor is it about WordPress.

00:47.000 --> 00:51.000
The ideas are transcendent to various programming languages.

00:51.000 --> 00:55.000
And I hope you had a chance to read this quote at the beginning, because it's framing the talk.

00:55.000 --> 01:03.000
About a year and a half ago, I spent about three months deep diving into HTML, XML, and HTML history,

01:03.000 --> 01:09.000
pouring over the mailing lists, reading the HTML specification, and surprising patterns stuck out

01:09.000 --> 01:17.000
since the beginning of interconnected computers and marked-down languages that allowed us to add structure to our human communication.

01:17.000 --> 01:24.000
We have been encouraging one another to use substandard tools to create corrupted documents.

01:25.000 --> 01:34.000
In terms of HTML, we ended up building a fully HTML spec compliant parser in user-space-peach

01:34.000 --> 01:39.000
code, and that's been shipping in WordPress for over two years.

01:39.000 --> 01:44.000
I like to motivate this talk by asking, how did we get to this point?

01:44.000 --> 01:49.000
And everything on this slide represents what I like to call legitimate HTML.

01:49.000 --> 01:53.000
It is true. These are all malformed snippets of HTML that have different errors.

01:53.000 --> 01:59.000
But in HTML, errors are benign in a sense that there is no actual erroneous HTML.

01:59.000 --> 02:05.000
A common misunderstanding is that HTML is loose and forgiving, but it's more specified than XML.

02:05.000 --> 02:09.000
I usually like to challenge the audience to tell me what's going on in the highlighted snippet.

02:09.000 --> 02:13.000
But first, sake of time, we've got a tag with four attributes.

02:13.000 --> 02:21.000
Three of whose values are the empty string, and one of whose value is less than three.

02:21.000 --> 02:26.000
This work all started going to or from a WordPress conference.

02:26.000 --> 02:31.000
There had been a few bugs in a row, three particular bugs in a row that brought crashes to WordPress.

02:31.000 --> 02:38.000
And they all stem from the same silly mistakes of assuming that white space was present or that white space wasn't present

02:38.000 --> 02:41.000
or that attributes used double quotes instead of single quotes.

02:41.000 --> 02:45.000
And I said, it's finally time to solve this comprehensively.

02:45.000 --> 02:55.000
And we'll just do the work. We'll build the full regular expression that gets all of this stuff to get an attribute out of an HTML tag. Simple, simple, simple.

02:55.000 --> 03:01.000
And then I realized you have to start at the beginning of an HTML document and parse your way forward if you want to be able to do that reliably.

03:01.000 --> 03:04.000
And so was born our HTML parser.

03:04.000 --> 03:17.000
Along the way, we learned an awful lot and what I want to focus on today is what we learned about DOM interfaces and more generally about parsers that are not designed, which I'll expand on later.

03:17.000 --> 03:23.000
PHP historically has had DOM document, which is a native implementation of an XML parser.

03:23.000 --> 03:31.000
It was used mistakenly for many years to parse HTML documents, but as of PHP.5 it finally has an HTML parser.

03:31.000 --> 03:36.000
My phone comes with an HTML parser. A number of languages have one built in.

03:36.000 --> 03:41.000
But the DOM interface is expensive.

03:41.000 --> 03:51.000
When we're talking about speed, it's actually faster parsing our HTML than our user space PHP parser is, which is no surprise, because it's running native code.

03:51.000 --> 03:57.000
But as we'll see later, practically our code ends up with faster code.

03:58.000 --> 04:01.000
But speed is one thing, because most documents are small.

04:01.000 --> 04:08.000
At WordPress.com, for example, we handle hundreds of millions of requests all the time.

04:08.000 --> 04:17.000
And most, you know, 98% of those requests or 98% of what we do with HTML are small snippets and they go extremely fast.

04:17.000 --> 04:23.000
So something that's a little faster or a little slower is a little better or a little worse.

04:23.000 --> 04:37.000
But if something bloats memory and causes the request to crash, then we go from a system that's working fully to a system that doesn't work at all and essentially has an infinite response time in a moment.

04:37.000 --> 04:44.000
DOM interfaces that build a tree from an HTML or an XML input are extremely wasteful.

04:44.000 --> 04:50.000
And this is where their costs all come from, because they start with the document in text form.

04:50.000 --> 04:59.000
Then they parse completely through the entire document. They perform a bunch of computation to convert text to read attributes to do air checkup, air clean up.

04:59.000 --> 05:07.000
And they do all that before you're even allowed to see the first token as a result of that parse.

05:07.000 --> 05:13.000
So HTML API in WordPress is what we call a set of classes that are used to parse HTML.

05:13.000 --> 05:22.000
And we designed it to be fast enough to be efficient for all requests and to be as close as possible to zero allocating as we could.

05:22.000 --> 05:27.000
So we wanted to build a streaming re-entrant parse so that we can pause and resume from.

05:27.000 --> 05:35.000
The DOM interfaces though, even if they were perfect in this regard, are inconvenient interfaces for doing what a lot of people want to do.

05:35.000 --> 05:39.000
And maybe it's wrong to pick on PHP because of its legacy here.

05:39.000 --> 05:47.000
But, you know, if we write standard code that we want to do this work on, we run it and then it says, you know, oh, there's an error because there was no HTML tag.

05:47.000 --> 05:53.000
There was no dock type declaration. So we say, okay, I remember I need to put in these lib-examel flags.

05:53.000 --> 06:00.000
And then it throws another warning. It says lib-examel no warning was for DOM document. You can't use that for the HTML documents.

06:00.000 --> 06:02.000
We take that out and then we run the code again.

06:02.000 --> 06:11.000
And then when we see what it spits out, it's added a dock type, it's added an HTML tag, it's added body, so we go back and we add the no implied flag.

06:11.000 --> 06:22.000
What this highlights is that there's an awful lot of internal nuance that any developer needs to know when using these official parsers before they can even do the most primitive basic change to a document.

06:22.000 --> 06:30.000
WordPress's HTML API code looks structurally similar, but it's considerably different inside.

06:30.000 --> 06:36.000
For one, we don't have configuration flags because a browser doesn't have configuration flags.

06:36.000 --> 06:48.000
We've done a lot of work to try and use the naming of the methods on this parser to simultaneously educate developers about HTML or in other words the language they're working with.

06:48.000 --> 06:54.000
And we've put a lot of effort to try and make it really, really difficult to mess up.

06:54.000 --> 07:06.000
From the developer's perspective, the code doesn't feel a whole lot different in this particular use case, but the output is also significantly different.

07:06.000 --> 07:22.000
If we take a look at the malformed input undergoing the same transformation from the previous two slides, we can see that the output in the middle from a tree builder or a DOM interface, normalized all the attributes, double-coded on minute removed duplicate attributes.

07:22.000 --> 07:32.000
But surprisingly, WordPress doesn't do that. WordPress even left the duplicate source attribute in here, because it knows that duplicate attributes on HTML tags are ignored.

07:32.000 --> 07:38.000
We realize that browsers are all performing the same error cleanup that the server was doing.

07:38.000 --> 07:48.000
And so there's no point in us if we have 100,000 requests to make the same transformation 100,000 times on the server and ship it out to browsers who are then going to attempt to run the exact same code.

07:48.000 --> 07:59.000
So in addition to thinking it's silly for us to spend all this time on precious limited server resources, it was also very important for us to minimize the diffs.

07:59.000 --> 08:08.000
What is the minimum change? What is the smallest change that we can impart on a given document in order to accomplish the goals that the developer wanted to request?

08:08.000 --> 08:14.000
So there's all sorts of cleanups we don't do because they're not necessary.

08:14.000 --> 08:22.000
And that's not to say that we should never normalize our HTML because there's a lot of systems where we kind of want to prevent downstream parsers from having trouble.

08:22.000 --> 08:25.000
We can do that. We have a normalized function.

08:25.000 --> 08:35.000
But we're not going to do that unless you ask for it. We like to make the cost of what we're doing obvious and clear to the developers.

08:36.000 --> 08:39.000
It may not be obvious, but dumb parsers are not safe.

08:39.000 --> 08:44.000
And this is true even in the browser's JavaScript DOM.

08:44.000 --> 08:47.000
We have to talk about safety because there's different levels, right?

08:47.000 --> 08:53.000
Like the most important level of safety is I shouldn't have the ability to convince you to log me into your bank account.

08:53.000 --> 09:01.000
And thankfully a DOM interface is not going to allow us to do that. Like an HTML spec compliant parser is not going to let us do that.

09:01.000 --> 09:10.000
But actually it does. Because it's able to create these nodes in memory in the tree form, which cannot be represented by HTML.

09:10.000 --> 09:16.000
It's kind of silly, but not everything that runs in a browser can load in a browser.

09:16.000 --> 09:20.000
It can only happen when JavaScript code runs and creates that DOM.

09:20.000 --> 09:26.000
Now on the server, we can actually do this. We can create the same DOM in memory.

09:26.000 --> 09:34.000
The problem is we cannot get that DOM in memory out in a way that the browser will recreate it.

09:34.000 --> 09:40.000
This is not in a security context, but I still consider it safety related on corrupting content.

09:40.000 --> 09:47.000
We've asked the DOM API to add an H2 element as the child of an H1 element.

09:47.000 --> 09:51.000
If you're keen on HTML, you know this is not allowed.

09:51.000 --> 09:58.000
But the DOM API allows us to do it anyway. And if we take a look at what it spits out, it literally did what we asked it.

09:58.000 --> 10:04.000
It put an H2 element inside of the H1. The problem is that this doesn't round trip.

10:04.000 --> 10:11.000
If you then feed what it spits out directly back into itself, it produces a different tree in memory.

10:11.000 --> 10:18.000
This is a divergence between what the developer thought that we're doing and what actually happens when that HTML loads in a browser.

10:19.000 --> 10:31.000
In WordPress, we don't allow this. We don't allow anything to happen where what a browser interprets would be different than what it looks evidently like you're doing from the code.

10:31.000 --> 10:39.000
And that brings us to one really big misunderstanding on the web highlighted in this famous stack overflow post.

10:39.000 --> 10:44.000
Who's seen this? Anyone? Don't use regular expressions to parse HTML.

10:44.000 --> 10:49.000
And then the quote ends with, have you tried using an XML parser instead?

10:49.000 --> 10:56.000
And this is so funny to me because this is almost as bad advice as telling someone to parse HTML with a rejects.

10:56.000 --> 11:05.000
It's true that some HTML documents can be parsed by XML, just as some HTML documents can be parsed by regular expressions.

11:05.000 --> 11:13.000
But they're different languages, and the differences go so far beyond self-closing flags, double quotes, and character references.

11:13.000 --> 11:22.000
Even JavaScript has a different operating model in XHTML contexts, which of course almost don't exist on the web.

11:22.000 --> 11:32.000
There are probably a countable number of XHTML pages on the web because when a browser reads a document that claims to be XHTML, it actually uses an HTML parser.

11:32.000 --> 11:42.000
Unless there's a specific HTTP header indicating that its content type is XML.

11:42.000 --> 11:57.000
Dom interfaces are atomic. I mentioned it a moment ago that when it reads in its HTML, it performs 100% of the work to produce a pristine tree in memory before it lets you even start.

11:57.000 --> 12:17.000
With our parser, which I think you might call this a pull parser, we scan through each token in the document and we return control to you when if you ask to step through each token or if you ask to go to specific tag or if you tell it to go seek until you find a tag with a given class name.

12:17.000 --> 12:27.000
Which means you can do really interesting things like give it a budget or give it a word count in which case it will progress through the document, understanding the structure of the document up to that point.

12:27.000 --> 12:32.000
And when you stop, it's done parsing. It says if the rest of the document doesn't exist.

12:32.000 --> 12:45.000
So even though our parser may be two to six times slower than a native pars running C code under the hood, which this wouldn't even apply if you were recreating the interface in another language.

12:45.000 --> 12:54.000
And the fact that we're end up parsing less than one percent of the document means we still run faster. And by the way, we've used zero bytes of memory.

12:54.000 --> 13:07.000
Well, we've probably used a couple hundred bytes of memory while we've routinely see it would press.com request that crash because the Dom parser uses hundreds of megabytes of memory.

13:07.000 --> 13:13.000
Because it's re-entrant and streaming, we don't even need to have the full document before we start parsing.

13:13.000 --> 13:23.000
In fact, as many of you probably know WordPress is quite extensible. And there's times where we're marching through a document and we get to something that will then generate HTML.

13:23.000 --> 13:31.000
We're able to kind of like step into that part of the document, generate the HTML, append it to the parser and then progress until we run out.

13:31.000 --> 13:45.000
And it'll even tell us if it stopped in the middle of a token. So whereas a browser does certain things to terminate there, we can say, oh, we started parsing a tag, but we didn't complete it yet and the document ends there.

13:45.000 --> 13:55.000
So we're just going to wait and you get to choose what to do. You can pop in the next chunk or you can treat it as the end.

13:55.000 --> 14:03.000
A lot of tooling ends up assuming that HTML is just tags and text. But as we all know, there's character references.

14:03.000 --> 14:11.000
And I just want to highlight a few tools that either exist in WordPress or I've built on my own.

14:11.000 --> 14:19.000
The highlights some of the needs we have, or at least I see a lot of developers have, when working with code, that a tree doesn't solve.

14:19.000 --> 14:27.000
We want to go and we want to check all the sources or all the HREF attributes. And we want to see if they start with JavaScript.

14:27.000 --> 14:40.000
But then we forget that they can be encoded in character references. And maybe some code does look for, you know, the J ampersand hash X60 something or 70 something.

14:40.000 --> 14:50.000
But they forget that it can be no capital A or lowercase A. They forget that you can have an arbitrarily number of zeros leading that prefix before the end.

14:50.000 --> 15:00.000
Or they forget that it can be decimals to text decimals. So we provide a zero allocation method that you can, you can say like, hey, look, does this start with this raw text?

15:00.000 --> 15:06.000
And as a developer, you don't have to think about it.

15:06.000 --> 15:15.000
I also built a very basic tool to search for plain text content within an HTML document and give back the indices where that match was found.

15:15.000 --> 15:21.000
You can see here that the first letter the match was itself a character reference and it spans this EM tag.

15:21.000 --> 15:28.000
But we can go step further and tell it to extract the search result in which case something funny happens.

15:28.000 --> 15:42.000
We can see that the full input isn't adequately represented in the output because we drag the context where that match was found and we provided an isolated contained chunk or fragment of the document.

15:42.000 --> 15:47.000
This is central to the problem with XML tooling out there on the web.

15:47.000 --> 15:55.000
This is central to problems with HTML tooling that when we pull a fragment of the document out, we lose relevant context.

15:55.000 --> 16:02.000
We're not even worse with XML where there might be entity definitions in the preamble the document and there might be namespaces.

16:02.000 --> 16:06.000
There might be namespace changes along the way.

16:06.000 --> 16:15.000
And so what we've done is we said, look, if we're going to pull something out of here, we need to adequately represent the way this appeared in its source document.

16:15.000 --> 16:24.000
If I want to quote a blog post online, I want to get that content out, but I don't want the whole page, but I also don't want to like lose the, you know, in this case,

16:24.000 --> 16:31.000
the fact that this was emphasized. So the HTML API has provided that tag for us.

16:31.000 --> 16:36.000
Also, it works with regixes.

16:36.000 --> 16:52.000
Now, not every HTML document can be represented in XML. XML, serialization of HTML, also known as XHTML, can actually generate those invalid DOM trees that HTML cannot.

16:52.000 --> 16:59.000
But the HTML also cannot comprehensively be represented in XHTML. The same way it cannot be represented in the DOM.

16:59.000 --> 17:12.000
But WordPress is able to translate HTML into XML or reinterpret it in ways that are so cool if you're into this thing, which you may not be, and that's fine.

17:12.000 --> 17:18.000
But you can see there's a lot going on under the hood a lot more than slashes and quotes.

17:18.000 --> 17:31.000
We understand what happens when namespace has changed. If you look in this input HTML, we know that the P tag implicitly closes the SVG and kind of brings us back into the HTML name space.

17:31.000 --> 17:44.000
And we're able to adequately represent that in XML. And this stuff will round trip that it won't produce the input HTML, but it'll produce equivalent HTML.

17:44.000 --> 17:56.000
My two favorites, the two most common use cases I see where HTML gets corrupted are truncation and conversion to plain text formats like Markdown.

17:56.000 --> 18:01.000
Because of the way that this parser steps through the document, we can tell it.

18:01.000 --> 18:11.000
It's essentially seen three words, and it'll just do it. It'll give us that same isolated contained fragment of HTML.

18:11.000 --> 18:18.000
It's recognized, okay, we hit the end with this EM tag, but there's tags that are still open.

18:18.000 --> 18:30.000
So it's not only going to provide implicit openers, it's going to add implicit closers, and just by changing a single parameter and inside of the function, it's like a simple if statement, we can just grab this plain text content.

18:30.000 --> 18:39.000
What I've probably done a terrible job communicating is parsers have to follow design.

18:39.000 --> 18:45.000
If we want to be successful when we're designing parsers, then we have to start by asking, what are they for?

18:45.000 --> 18:54.000
What is the goal of being able to understand HTML or XML? What is our goal in reading in RSS feed?

18:54.000 --> 19:07.000
The DOM is extremely important, particularly in the browser, because the browser turns that into UI elements, and it'll even do weird UI stuff like put a button inside of a button.

19:07.000 --> 19:18.000
But when we're on this server, and we have hundreds of thousands of requests and latency matters and memory use matters, and all we want to do is find an interesting spot in the document,

19:18.000 --> 19:25.000
and make a change, and continue, then we want to be able to design our interfaces around those goals.

19:25.000 --> 19:46.000
And I think it's really sad, because I've come to love XML. I did not always love XML, but I've come to love it, and I think it's a tragedy that we lacked the adequate tooling that was convenient enough, reliable enough, and safe enough, for people to reach for that instead of looking to use these substandard tools.

19:46.000 --> 19:57.000
I think a lot of times we get this idea that there's some kind of inherent way to turn a string into a particular thing, especially if there's a specification for that.

19:57.000 --> 20:03.000
But the specifications can be a guide for how to properly interpret content.

20:03.000 --> 20:20.000
We took a look at the needs, we had the benefit of having 20 years of HTML handling code and WordPress to look back and say, where have these interfaces, where have these parsers failed us, and use that to design the new system.

20:20.000 --> 20:31.000
But if you never stop and ask, if we just say, okay, I'm going to set out today and write a JSON parser, then we're missing a huge opportunity to meet real needs that aren't being served.

20:31.000 --> 20:42.000
A quick note, WordPress is pushing upstream with this. We just closed out a bug in LexPore. LexPore is the open source HTML parser that actually went into PHP.5.

20:42.000 --> 20:57.000
We noticed it was miss parsing script tags, where our code isn't. And they fixed it. We've also pushed bug fixes to the HTML spec itself, where it's either ambiguous or conflicting in what it says.

20:57.000 --> 21:09.000
And just to clear my conscience, we do have a rust port of some of this so that we can build it to WebAssembly and get it running in the browser, where it's faster than Chrome's HTML parser.

21:09.000 --> 21:16.000
Again, because it skips doing any work, it doesn't have to do.

21:16.000 --> 21:25.000
I could go on for hours, but I want to leave some time for questions. I thank you so much for attending and putting up with this fast pace.

21:25.000 --> 21:37.000
Thank you very much. We have five minutes for questions.

21:37.000 --> 22:03.000
Okay, in this case, does anybody want to guess what the attribute names are in this slide?

22:03.000 --> 22:29.000
The first attribute name is the Kalamoji. And you guess what the second attribute name is?

22:29.000 --> 22:41.000
The second attribute name is equal sign backtick and no backtick. Anybody want to take a guess what the third attribute name is?

22:41.000 --> 22:55.000
It's not the equal sign. It's the ampersand. How about the fourth attribute name? I'm going to come back to you. What's the fourth attribute name?

22:55.000 --> 23:05.000
So close. It's actually the equal sign. Question.

23:05.000 --> 23:19.000
That's how it was spec compliant HTML parser will read this tag. The tag's name is i, the letter, and then the isomoji.

23:19.000 --> 23:29.000
We get to these places because we look at HTML inputs. We happen to have seen and we try to form a mental model around that.

23:29.000 --> 23:42.000
That actually works reasonably well with XML, but that's not how HTML works. HTML is this dual stack machine that when you learn about how it works makes a lot of sense.

23:42.000 --> 23:50.000
I'm going to clarify all the problems with missing tags or optional tags.

23:50.000 --> 23:57.000
But if we want to avoid the characteristic problems in this domain, we just want to start with a spec.

23:57.000 --> 24:05.000
And that'll turn us from writing code and then spending a decade handling edge cases into writing code.

24:05.000 --> 24:12.000
And figuring out how to anticipate the kind of inputs which would trigger bugs in other code.

24:12.000 --> 24:19.000
Because we've started with the mental model of how the system works and implemented its rules.

24:19.000 --> 24:30.000
Instead of starting with, well, I wrote an HTML tag once.

24:30.000 --> 24:42.000
Thank you very much for the talk, right? It's like they said, keep your dragons and you will, yes, drag these.

24:42.000 --> 24:49.000
Thank you. There are dragons, but they're fun dragons.

24:49.000 --> 24:54.000
Oh, they are pretty nice.

24:54.000 --> 24:57.000
Okay, thanks for that, Dennis.

24:57.000 --> 25:03.000
Thank you everyone.