How Browsers Really Parse HTML (and What That Means for SEO)

2026-02-26 · en-j3PyPqV-e1s manual
[MUSIC PLAYING]
MARTIN: Well, hello, hello, hello, everybody.
This is Martin from the Search Relations Team.
And welcome to a new episode for "Search Off the Record."
With me today is Gary.
Hello, Gary.
GARY: I don't want to talk about it.
MARTIN: OK, but I do want to talk about something, because I
had a thought.
GARY: Oh, no.
Not again.
Martin.
MARTIN: I know.
GARY: We asked you this so many times.
MARTIN: I mean, they are few and far in between,
but I thought I'd give it a go again.
It's a new year.
It's a new me.
It's a new thought.
[CHUCKLES]
I promise, it's the only one for this quarter.
Is that OK?
[CHUCKLES]
GARY: So what were you thinking?
MARTIN: I don't think we've ever discussed
how HTML parsing works, which I think
is important to understand.
And I see that, especially when I
look at people who have been working
with the web for a long time, not all of them
are paying that much attention.
And I realized that I'm not paying enough attention
because there's a new kid on the block that I have heard about,
but not really looked into.
And that's client hints.
But I think before we can discuss client hints,
we should talk about how that generally works.
And I think you're the right person because you and I
discussed parsing beforehand.
So shall we talk about that?
GARY: OK.
MARTIN: OK.
GARY: Well, can I say no?
MARTIN: You can say no, but I'm going to talk to you about it
anyway, as if that made a difference.
I don't know when you will learn that, but--
GARY: Interesting.
MARTIN: I'm an excited puppy.
I'm going to bark up to you anyway.
[IMITATES PANTING]
GARY: OK.
MARTIN: HTML.
GARY: That's such an exciting topic.
So stepping back, why are you bringing this up now?
MARTIN: I know that the way you build your websites
has an impact on how it performs in terms
of perceptible speed for the user,
as well as how it performs when crawlers
have to interact with it.
And there are things coming, and have
come, in the last couple of years that I honestly slept on.
So I think now is as good as any time
really to catch up people out there, as well as me, on this
a bit.
GARY: So basically nothing happened.
Did you just want to talk about it because we
haven't talked about this?
MARTIN: Yeah Pretty much.
GARY: All right.
I was asking, because I find that when we are grounding
these discussions into an issue that we found,
it's more interesting to me to talk about
because you are explaining an issue versus trying
to describe a system.
But of course, this should be fine as well.
MARTIN: Do you have an issue in parsing that I'm not aware of?
GARY: Oh, so many.
[LAUGHTER]
I mean, parsing HTML is notoriously--
MARTIN: Challenging?
GARY: What's the PC term?
Yeah, let's go with challenging.
You would think as a fellow developer who probably started
in the early 2000s, or even in the 1990s,
that you can just write.
Or at least when you were newbie,
you might have thought that, hey,
I can just write this nice regex, or "rejex"
as John Mueller would say, and that will work.
MARTIN: I did that.
GARY: Right?
MARTIN: I did that.
It did not work.
GARY: Yeah, me too.
I think everyone who ever tried to develop something
for the internet at one point in their life will have written
a piece of regex slash "rejex" that probably worked for some
cases, but not for all cases.
MARTIN: Yep.
GARY: And that is because technically, HTML
is supposed to be this beautiful structured thing.
But in reality, it's just a mess because, well, it
has to work all the time in browsers, which
means that browsers are extremely lenient about what
they accept, which in turn turns the developers extremely
lenient.
And then they spit out random stuff
in their notepad.exe [CHUCKLES] which will work in browsers,
will work for the users.
But it's going to be a nightmare to parse.
MARTIN: True.
GARY: Right?
MARTIN: Yes, and I found the standard also quite lenient.
GARY: Yeah.
MARTIN: It allows a lot of stuff.
It's interesting.
Yeah.
GARY: Yeah, we should probably link to the standard
in the episode notes.
But basically, it's a living standard.
It keeps changing, depending on what the web needs.
MARTIN: Yeah, I would say that's true.
And it postulates how browsers and user agents in general
should deal with what's out there on the web.
And they are trying to minimize breaking what is already
on the web while keeping it flexible enough for new stuff,
which I think is pretty cool.
It's a pretty impressive effort, I would say.
GARY: Yeah, I mean, it's been alive for 30 years or so.
MARTIN: Wild.
GARY: Even the age is a testament to how cool it is.
MARTIN: I also remember that when
I started building websites, I was absolutely and madly in love
with the validator.
There's a thing that tells if your HTML is valid or not.
GARY: Oh yeah.
MARTIN: And then I was very depressed
when I found out it doesn't matter as much
GARY: What's this called, W3C--
MARTIN: Validator.
GARY: Validator.
Yeah, I also used to obsess about that as a younger newbie
developer.
Is it "new-bye" or "new-bee?"
MARTIN: "New-bee" I believe.
GARY: OK newbie.
Oh yeah, the two native English speakers on the team.
[CHUCKLES]
Anyway, I was obsessing about it quite a bit.
And then eventually, I noticed that it really doesn't matter.
It doesn't matter for the browsers.
It doesn't matter for search engines.
Unless you do something utterly stupid with your HTML,
it's just going to work.
I think that this has evolved to this stage where we are now,
because in the earlier days when you still had Netscape,
for example--
Netscape is a very, very, very old browser for listeners.
Back in those days, you did have to do really hacky stuff,
because you had Netscape, you had Internet Explorer,
what else?
MARTIN: Safari at some point.
GARY: At some point, yes-- the early Firefox.
And they were lenient in different ways.
And then you had to do some hacks,
like including special CSS for just Internet
Explorer or Netscape.
MARTIN: I remember that.
Yeah, like the star hack.
Yeah, this only works in certain browsers,
so you can use it to address those quirks and those ones.
Yeah.
Cross-browser compatibility was a huge issue.
GARY: Yeah, and back then, looking
at the W3C validator actually mattered, because the more valid
your HTML was, in theory at least,
the better it worked across different browsers.
But nowadays, I think that matters very little.
MARTIN: OK.
GARY: I don't know if you agree with that or not.
MARTIN: I do agree with that.
But I know that there is nuance here,
which is, there are ways to break
things that then break expectations
in ways where then, yeah--
GARY: There's always ways to break things.
MARTIN: People were like, oh, so you have to be, like,
100% compliant to the spec.
And then other people were like, eh, it doesn't matter.
And then they build something that doesn't work.
And they're like, whoa, why does this not work?
GARY: Yeah
MARTIN: So it's not as easy as saying, it doesn't matter
or it matters a lot, or it matters
a little, because it depends on what you're looking at, right?
GARY: Yeah.
MARTIN: I give you an example because I think we have
a video-- and I'm going to link the video in the description
of the podcast as well, where I discuss with--
I believe it was Bastian Grimm, a case where
they had hreflang link tags in the head where they belong.
But before them, there was a script
which can also appear in the head that is legitimate, specs
compliant.
But then the script injected an iframe right after itself,
and that kind of closed the head.
And then the links moved into the body,
and that's where our infrastructure ignored them--
correctly, so I would argue.
GARY: Right.
Oh, I would strongly argue for that as well.
So if you go back to the living standard,
and then you look at what can appear where,
you focus on meta tags in this case, or link tags.
And it uses some kind of floral language about where
those tags or elements, or whatever you want to call them,
can appear.
Basically, for example, for meta tags
it says that they can only appear--
and this is from memory.
I might use the wrong words.
Meta tags can only appear in sections
where metadata is defined or in the context
of metadata definitions.
And that is a very broad thing to say.
But then if you start looking at the spec again
and start looking at where are you allowed to define metadata,
it's just a head.
I couldn't find any other place where you
are allowed to define metadata.
MARTIN: Oh man.
I think I looked at it, and I think
there is one specific case where you can do it in the body.
But it's a really limited edge case.
Maybe I'm hallucinating.
GARY: No, no, no, you are not.
I have to specify-- a meta tag with a name attribute can only
appear in the context of other metadata.
And other metadata can only appear in the head.
So, for example, if you take the meta name robots element,
because that's a named meta element,
according to the standard, that can only appear in the head.
MARTIN: OK, yeah, that's very possible.
And I think charset can also only appear in the head.
GARY: Yes.
And I haven't looked at link tags
because I didn't have a reason to.
I mean, recently at least.
But I would assume that those can only appear in the head
also.
MARTIN: I think that's the case, yes.
GARY: I mean, we can look it up.
MARTIN: And if they appear in the body,
they are discarded or something.
GARY: They being standard.
Now, we are using our favorite search engine.
I'm not going to say what it is because reasons.
The link element.
MARTIN: The link element.
Yeah, where metadata--
GARY: It is metadata content.
And the context in which this element can be used
is where metadata content is expected, which takes us back
to the head.
MARTIN: True.
GARY: But then weirdly enough, the standard also
says that no script element that is a child of a head element,
it can appear.
Sure, that makes perfect sense.
And then finally, it says that if the element
is allowed in the body where phrasing content is expected--
I don't know what that means.
So we click that.
[CHUCKLES]
So phrasing content is the text of the document,
as well as elements that mark up that text in the intro paragraph
level.
Well this is not helpful.
MARTIN: Oh, but it's stuff like most of it.
GARY: Yeah.
MARTIN: I wonder--
GARY: OK, I figured it out.
I'm a genius.
It's the same.
[CHUCKLES]
Also, I'm very humble, if you haven't noticed.
MARTIN: Noticed, yeah.
GARY: So it's the same as with the meta name.
If a link element has an itemprop attribute,
or has a rel [? attribute ?] that contains only keywords that
are body OK-- whatever that means,
then the element is said to be allowed in the body.
MARTIN: OK, yeah, that makes sense.
GARY: So I'm assuming that you can use it for RDF stuff?
I don't know.
MARTIN: Anyway, generally you would expect them in the head,
I would argue-- links
GARY: Right, in body, you can also find it.
But only if it's used for very specific purposes.
So for example, again going back to the '90s-- well,
not '90s, early 2000s or mid 2000s, you remember pingback?
MARTIN: Yes.
GARY: Like on blogs, you could find those pingback thingies.
And pingback is OK.
If it's a real pingback or link rel pingback,
that is OK in the body for some reason.
I don't know why.
[CHUCKLES]
Prefetch, preload is also OK.
Style sheet is OK, because it's not technically metadata.
It's a thing that will change how the page looks
in case of style sheet.
With preload, you are just instructing the browser
to do some magic in the background
to load the next thing faster.
But as you said, in general you would expect link elements,
or at least those that carry some form of metadata,
to be in the head.
MARTIN: Yeah.
GARY: And I would argue that it's really quite dangerous
to have link elements that carry metadata in the body.
MARTIN: Yeah, I see that.
I see your point.
And I think I follow that.
Yes.
GARY: OK.
MARTIN: I'm still mulling over this body OK bit.
So if it has an itemprop, then it
can potentially live in the body.
GARY: Right.
MARTIN: OK.
It has to have one of two properties.
But I'm not sure which of them needs to be there.
It has itemprop, or probably href, I guess.
GARY: I don't know.
I mean, we can look it up.
[CHUCKLES]
MARTIN: Anyway, but why does the browser, for instance,
close the head when it sees something
that shouldn't be there?
GARY: Well, exactly, because that
assumes that the page finished loading the things that
should be in the head.
So for example, if you put a paragraph, like a p element
in the head, that's basically content.
Metadata is not shown on the page.
MARTIN: True, true.
So the metadata is in the head.
And then whenever we see something that isn't metadata,
then the browser has to assume that the intention was
that this is shown to the user.
And that would mean that the body has started, and it,
quote unquote, has "missed" that the body has started.
So it starts the body for us automatically.
OK, got it.
GARY: Yeah, I think so.
And for search engines, that's probably the same.
They try at least to behave more like browsers.
Sometimes that works, sometimes that doesn't work.
And they would accept these tags, elements in the head,
but not in the body.
And some of the things that you can specify for search engines
actually carry a lot of weight, let's say relcanonical.
MARTIN: Yeah, that's a pretty strong signal
to a search engine.
Yeah.
GARY: Yeah.
But if we allowed that in the body,
then mischievous Martin Splitt could show up on my blog
and put it in a comment.
And because I'm really bad at escaping HTML,
Martin could hijack my page, point with a racon
equal to his blog.
And suddenly I don't have anything in search anymore.
MARTIN: OK, but wait.
Interesting point, but counterpoint.
If I have the power to inject random markup that
gets parsed as markup, I could inject a script.
GARY: Right.
MARTIN: And ask it to add the link rel canonical to the head
as well.
GARY: Yeah.
Is it much simpler to just have a link tag?
Like you would want to go for the simplest.
MARTIN: Sure.
But if it doesn't work, then I can still
use JavaScript to get around that limitation.
GARY: Ah.
Sure.
[LAUGHTER]
MARTIN: See?
Mischievous Martin Splitt is mischievous.
"Mischeevous?"
"Mischivous?"
GARY: I see what you mean.
MARTIN: Yeah.
GARY: Ah, but we can get around that.
MARTIN: By not rendering.
GARY: By rendering.
MARTIN: Oh.
GARY: Oh, wait.
No, I was thinking that if the link
was introduced by rendering--
MARTIN: We can tell that because we have the original thing,
and we have the thing after rendering.
GARY: Yeah, I wanted to say something relatively stupid.
It's like, we don't accept the link rel canonical if it
was injected by rendering.
But we cannot do that.
We have to accept the link rel canonically.
MARTIN: Because there's legitimate cases where
that is done, yeah.
I think we're coming to a point where we need to realize,
as well as the people listening to us--
and this is interesting, because we actually
are thinking about this as we speak that there are decisions
to be made by whoever is consuming HTML that
are going beyond the standard.
The standard is just, you should do this
if you want to work with an HTML document.
But there are additional rules that you can, and probably have
to, put on top of the standard that are not
defined by the standard because they are application-specific.
GARY: Yeah.
MARTIN: So this is interesting, because I
know that browsers are doing a few things that are not exactly
described in the standard either.
As in when you run into a script tag,
normally, unless you use any modifying attributes
to the script tag, the browser stops doing things there,
executes the JavaScript, and then carries on.
Otherwise, if it's basically just like HTML
head with some metadata body, some text,
some images, then it kind of does like a preliminary scan
to see if there's any images, so it can in the background,
start downloading those.
And then it starts building the dom tree and the render tree,
and making sure that you basically start seeing text
as soon as it possibly can, so it doesn't parse the whole HTML
and then show you things.
It shows you things as it goes through the HTML.
And the standard doesn't specify how that works.
That's how browsers kind of work, I believe.
GARY: Yeah.
MARTIN: Yeah, but there are exceptions.
There are specific metadata bits and pieces
that do give us, as the website owners, and us in terms of us
as a search engine company, hints
and suggestions as to what to prioritize and how to do things,
right?
There's DNS prefetch.
GARY: Yeah.
MARTIN: There's preload, I believe, as well.
And then there's script defer, script async.
Are we using any of that?
GARY: Sure.
MARTIN: Nice.
GARY: I don't know what we are using from those things.
I don't think we are using much because we don't need to.
It's very helpful if you have crappy
internet to do DNS prefetching, for example.
In our case, we don't need to because we
can talk very fast to order the cascading DNS servers,
for example, to resolve whatever, or preconnect.
Like, why would we preconnect?
We are not following links, for example.
And even for rendering, the fetching of resources
is not synchronous.
MARTIN: True, yeah, because we're doing batch stuff.
GARY: Yeah, and we don't refetch the resources
necessary for a page all the time.
Basically, we are caching on our side.
yeah, we are caching on our side the resources
to save some bandwidth and host load and whatnot for the site
itself.
Same with preload-- if we are not synchronous,
then we don't particularly need to listen and look at preload.
MARTIN: True.
GARY: These are very useful for browsers.
I was super, super, super-excited about it when this
came out in the late 2000s, I think,
because it was so easy to see how much it helps.
You just dropped one of these tags or keywords
in a link element, and it sped up things so much
because you were on an internet that was not necessarily great.
You had to connect from your location
to servers that were thousands of kilometers or miles away.
And all these little things like preconnect and, I don't know,
DNS prefetch and preload--
or prefetch, they were doing stuff in the background
that you didn't have to do anymore.
MARTIN: Yeah.
GARY: So yeah, I remember at one point,
Google introduced this link.
I think it was preload for the first search result.
MARTIN: Yes.
GARY: Or something like that, or first two or first three,
or whatever-- something like that.
And when I noticed it, in my brain-- again,
this was before I joined Google.
In my brain, that was nothing short of magic
because it loaded the search result page.
And I clicked the first result because I'm a sheep
and I do what other people are also doing.
I clicked the first search result.
And like that, just it was on my screen immediately.
And to me, that was mind-blowing.
So for browsers, it can make a huge difference to use these.
But for search, eh.
MARTIN: Did you know that one of the couple of reasons
that we had this memcache was the preload thing?
Because preload has a few problems.
And that's why it was deactivated.
I'm not sure if it's back.
But I think it was deactivated for a while in browsers,
because with preload, the problem is, you're effectively
triggering an action that you normally a user would.
And then you're giving cookies and stuff,
so people could infer, they have seen me
in search results or somewhere else.
And that was problematic.
GARY: Of course.
MARTIN: And you could avoid that by having
the memcache in-between, because then the memcache would download
things from the server without cookies and without being
able to trace it back to a user.
That's one of the things where I'm like, oh,
the memcache makes sense.
But then the discussion was so heated
that people had other issues with it,
and it had a lot of issues.
So I think that's fair.
Yeah, so you would say these link rel
prefetch and stuff is not useful from an SEO perspective.
But it is very, very useful for users still.
GARY: I mean, it depends how far do you want to go with SEO.
Because there are plenty of studies
out there-- independent studies even,
that show that people do appreciate quite a bit
when things load fast.
MARTIN: Of course, yeah.
GARY: And they convert better.
I don't remember what the studies say,
but I remember that they convert better.
MARTIN: Retention is higher.
GARY: Yeah, retention is higher.
So if SEO is just about search engine optimization, and just
the technical part of it, then these link hints or link
keywords don't really matter.
If you step beyond the technical SEO
and you also start looking at, once the user is on my site--
or on the site that I manage, how can I retain them,
how can I convert them better?
Then they can become quite useful.
MARTIN: Yeah, but it's tricky to measure that.
GARY: Sure.
MARTIN: That's why not many people are paying attention
to it.
So I'm happy that we're calling this out.
And I think in general, it makes sense as an SEO,
especially if you're on the technical side,
to understand what valid markup should look like.
And if a deviation from the specification is OK,
or if it's a deviation that is potentially problematic.
GARY: So would you agree that, for example, meta tags and link
tags belong in the head?
MARTIN: I would agree, yes.
GARY: When they provide hints for search engines at least?
MARTIN: Yeah, I would say so.
GARY: OK.
MARTIN: Especially because you can
assume that something that is in the body
was probably not put there deliberately,
or at least not in good intentions, because sometimes we
have this problem with mixed signals,
especially when JavaScript is involved.
If you have a canonical that is there at the first time,
we fetch the HTML from the server,
and then the JavaScript changes it.
We actually advise against doing that, changing something
with JavaScript, because then it's like,
what is the intention here?
Was the other one kind of accidental?
GARY: Yeah.
MARTIN: Was the other one the right one,
and now accidentally they changed?
GARY: Yeah.
MARTIN: I understand that there are situations where,
for whatever technical reason, you
can't have them in the initial HTML,
then add them with JavaScript.
Fine.
But these mixed signals are difficult and tricky
to understand the intention.
So giving as clear intention as possible, I think,
is generally the course of action.
And I believe that the metadata, then,
should also sit in the head to be very, very explicit.
This is our intention.
GARY: Yeah
MARTIN: OK.
Cool, I think that made sense, which is surprising.
OK, we talked about parsing.
We talked about hints in the metadata.
We talked about metadata in general.
I think that caught us up on the topic.
We finally discussed this in the podcast.
GARY: I mean, you still have the body.
But I think the body itself is kind of boring.
MARTIN: Yeah.
That's just the content.
GARY: Right, but there's no--
I don't see how there are gotchas there.
There's stylistic choices that you can make.
And I'm talking about the source, not like what you see.
MARTIN: Yeah, not what you see.
GARY: There are stylistic choices that you can make.
For example, internally, I'm really
fussy about breaking lines close to 80 characters,
because then it's easier to review stuff.
MARTIN: Yeah, it's easier to review on your Commodore 64.
GARY: Sure.
Have you seen my setup?
MARTIN: Yeah.
It's a nice setup.
I like the vintage anyway.
GARY: But for majority of the programming languages that we
use at Google, one of the big ones is C++.
And I wrote a lot of C++ at Google, or C and C++.
And for that, you have to break the line at 80.
So everything needs to fit in 80.
MARTIN: That's the style guide, yeah.
GARY: That's the style guide, yes.
And most of our review apps or software that we have,
they will tailor for that, for those 80 characters.
So the review platform that we are using
is going to do really weird things when something
runs more than 80 characters.
And then if you are reviewing big documents,
all those little weird line breaks
is going to be really weird to review.
So yeah, I'm breaking at 80 characters as much as possible,
even HTML.
But other than that, I cannot think of other things that you--
MARTIN: I actually have a question for the body.
GARY: All right.
MARTIN: What's your stance on semantic markup?
So are you expecting a difference between me just
having a paragraph element, and then some text with links
and images, and another paragraph and another paragraph,
and me kind of using headlines randomly?
Or there's an HTML5 algorithm or structure,
like the standard says, oh, you should do this with one H1.
And then you can use article and section elements
on a page to give more semantic meaning at header and footer
and nav and all this kind of stuff?
Does that make a difference from a search engine perspective?
GARY: I don't think so, unless you do something really weird.
MARTIN: OK I think it helps, as in for the users
and for the browsers.
But I don't think it helps a search engine that much as well.
Yeah.
GARY: Oh, you asked me about search engines.
MARTIN: Yeah.
So search engines, do you think it's
a small difference in practice?
GARY: I think so, because you can say that something is valid.
That's a binary thing.
It's very hard to say that something is close to valid.
And then what do you do there when something
is just close to valid?
For example-- and this doesn't exist,
so don't try to come up with conspiracies.
But you cannot give a ranking boost to valid HTML for example.
MARTIN: True, true.
GARY: Because for example, if I miss a closing span, then
suddenly, my HTML is not valid.
It will not change anything for the user.
MARTIN: True.
Interesting.
But that's good to know, and that's something that I think
comes up every now and then.
It's like, oh, we should use only one H1 element,
and then H2 for all the different sections, versus just
use H1's for all my sections.
I think that's generally fine, especially because visually,
you can still do something with it,
if you don't care too much about the structure semantically.
OK, cool, that was an interesting conversation.
Thank you so much, Gary, for talking about parsing with me.
That was wild.
GARY: Yeah.
MARTIN: I liked it.
That was good.
So we can take away a few things that I didn't know or wasn't
sure about beforehand.
GARY: OK.
MARTIN: Like metadata in the body,
for instance, not necessarily a great idea as we discussed,
HTML validity, not as important as we
developers like to think sometimes.
GARY: What else?
MARTIN: Semantic markup-- not that important.
Useful for accessibility and users,
but not that important for search engines, at least.
And I think performance and performance improvements
for users do have secondary effects on SEO,
but not necessarily primary effects, because the way that we
as a search engine are using the documents
is different from how browsers for users are using them.
GARY: Indeed.
MARTIN: Yeah, I think that was really interesting.
And I think those are a few really good takeaways.
We can ask the audience.
Feel free to comment on this podcast
and reach out on social media to me,
because Gary doesn't like to be talked to, I hear.
GARY: Yeah.
MARTIN: It would be interesting to hear
if you would more of this kind of stuff
or if this is too nerdy.
GARY: Yeah, and I think one of the problems is that Martin
and I probably can talk about this for seven more hours
because it is a wild topic.
And it is quirky to say the least.
And there's lots of facets that we can explore.
So if you have questions, just yell at Martin or John Mueller.
MARTIN: Yes, please.
GARY: Leave me out of the yelling.
Thank you.
MARTIN: Leave us comments below this episode
on the podcast platform that you are most happy with.
And we look forward to hear if this
is something that you all are interested in,
or if this is a nerdy echo chamber.
Anyway, thank you all so much for listening.
And thanks a lot to Gary for being here with me today.
Thank you.
GARY: Yeah.
[LAUGHTER]
MARTIN: I wish you all a fantastic day.
Take care, and talk to you next time.
Goodbye.
GARY: Goodbye.
[MUSIC PLAYING]
MARTIN: We've been having fun with these podcast episodes.
I hope you, the listener, have found
them both entertaining and insightful too.
Feel free to drop us a note on LinkedIn
or chat with us at one of our next events we go to.
If you have any thoughts, let us know.
And of course, do not forget to and subscribe.
Thank you so much for listening, and goodbye.
[MUSIC PLAYING]