How are web standards made?

2025-04-17 · en automatic
[Music]
Hello and welcome back. It's springtime
and we are back with a new episode for
Search of the Record, a podcast coming
to you from the Google Search team where
we talk about search and maybe have some
fun along the way as well. My name is
Martin and I am a what am I? Uh
developer. No, not a developer. Search
relations engineer I think is the
official
title. Someone excited about I don't
know. But I'm not alone in my confusion
I guess. Uh with me today is Gary. Hi
Gary. Maybe you're a unicorn. No, I'm
not sure. You that could be your title.
You're a house elf, right?
Yeah, I had many titles. I remember
titles
um maybe 10 years ago because I'm such a
cheerful person. Miley uh Miley O uh who
was uh working with us back
then. Uh she gave me a title chief of
sunshine and happiness. Oh. And I was
very h happy about that because u it's
super ironic.
No, that's very accurate, I guess. No,
it's very accurate. You're such a boy.
Martin, stop lying live. Oh, wait. This
is not live. Go on. It's off the record.
We can say whatever.
Oh, god. Anyway, um, so, uh, that was
my, uh, given title. And then the house
elf, I don't know where that came from.
I think it was someone asked me
uh what my title is like someone
external and then I was just like
fumbling like you did with like whatever
our title is and we don't know and then
I was just like you know what h how
house elf and then it just
stuck. I think I gave myself internally
we had this tool where you can look up
people working at Google. Uh, I think my
my title that I gave myself there is
open web cheerleader. So that that's a
fun one. H, by the way, open web. Oh,
yeah. Yeah. Remember Steve? I thought it
was cool. What? Remember Steve? Our
search engine?
The my my butler.
What? No, the search engine that we
built, you know, the toy search engine
that we that we built and uh haven't
been speaking about for a while now. Oh
my god, that was
like that was like 20 years ago. Yeah,
it feels like it. But it's, you know, uh
I was wondering like Steve had a cool
feature um that I think other other
people should use more. Can we make that
into like an internet standard? It's
already a meme, I believe, but we could
make it a standard. No. Should we do
that? Why?
You you've been at a standards meeting
recently. I know that you've done that
and I know that you have worked with
someone to make robots txt a
standard. Wouldn't it be cool if we made
more web standards like more Well, I'm
not even sure if it's a web standard. Is
it a web standard? Is robots txt web
standard? I don't know. I'm confused.
Um
so I I I think we have to go back to
what a standard is
first. Um and then we can qualify them
with internet standard or web standard
or whatever. Oh okay. Um
but I think it largely depends on under
what organization you are standardizing
something. Um, I work with the IETF, the
Internet Engineering Task Force. Within
that, a couple working groups, uh,
namely AI pref, uh, most recently the
the
TLS, like the what does that then stand
for? Uh, transport security working
group. Oh, okay.
Um and um
yeah, let's like if you want to go and
explore this then we should probably
look up what a standard is, right? Okay.
So I think a standard is kind of like a
an
agreement amongst a bunch
of players in a certain field. Let's say
like HTML is a standard because a bunch
of groups have agreed that HTML has
certain elements and is built in a
certain way and what are the things that
it has and hasn't and I think for HTML
it used to be the W3C the web uh
worldwide web consortium um but has
recently been well recently has a couple
of years back moved to a living standard
in under the
WATWG. Yeah. Um, so I think like a bunch
of people come together form like a
forum or a group where they agree on
certain things to be true or to to be
part of something I guess. I don't know.
So in in my head
in my head
um it's
uh typically a document um and drafted
by consensus. So basically someone
proposes something to a group and then
the group agrees
uh that it's a good idea and we should
continue with it. Um and that's the
approval state.
Um, and typically it has to be under
some sort of institution. Mhm. Or
consortium or something. Um, like
some governing entity. I I suppose
um so yeah basically what you said but
using fancier words. Um
and we h we have quite a
few such entities slash consortiums
um that govern standards that we are
using on the internet um namely like the
thing that I'm working with is the IETF
but there's also the W3C for example
um that creates standards um
and I
think that they typically agree upon
what they are governing. So for example,
the ITF is
governing well internet related stuff um
that is lower in the stack uh in the
internet stack. So for example
um uh transfer protocols like uh quick
or TCP IP or those kind of stuff HTTP
itself
um and
then I could be completely wrong but
because I haven't worked with them but
W3C is more related
to the markup perhaps. Yeah that is used
on the internet.
And then we have JavaScript that I don't
even know where it falls. Um but that
could be like a completely different
consortium because for example C has its
own thing like its own entity and own
governing body. So I would imagine that
JavaScript could have the same thing. It
has the TC39 the technical committee 39
but I'm not sure which organization that
is under because it's clearly part of a
bigger thing. But I I'll figure that one
out eventually. eventually, right? So,
what are we standardizing? Um, so I
thought like we had a pretty cool thing
where the the Steve could be told what
cool things you have on your website um
by basically having like a
cool.txt. So that's not really
markup. It's like robots txt but better.
What better? Nothing is better than
robots txt Martin. No. So it sounds like
a site map. Oh. Oh. So maybe we already
have that.
Yeah. Huh. Where does the sitemap
standard live? I mean we could try to
standardize sitemap. Isn't it
standardized? It doesn't. Um stepping
back like robot cxdt was a de facto
standard for what like 20 years or
something? Mhm. Or 25 years something
something like that. Um and then
eventually we we we standardized it
under the ITF. Why the IETF? Um because
that's what I was familiar with. Um like
no particular reason. Um
I'm a nerd and uh I read RFC's requests
for comments and uh basically internet
standards um to understand like how the
internet works and what are the new
things that uh happen on the internet
like new protocols that we can use for
stuff. Um so basically we just went with
the IETF.
Okay.
Um so similarly sitemap which was
created
um I mean the standard was created well
the facto standard was created um around
the year 2005 2006 or it was announced
in 2006 but was created in 2005
something like that. um
that is a de facto standard meaning that
there
um
no governing body that adopted it um and
pushed for it being an actual standard.
Okay. But it's the de facto standard
because everyone kind of agreed that we
are going to do it like that.
Yeah. Okay. Okay. So basically it's so
widely adopted in its original form
um
that basically it became a de facto
standard. I don't know what a better
word is for de facto but um it's it's
basically something that people just
adopted. Oh, like writing is not a
standard but what it's an informal
standard because once you have document
and like a larger organization adopts
it, it's a formal standard.
Yeah, perfect. Informal standard.
Then we don't use Latin anymore.
Informal. Fine. Um, if I had to guess
where we would go with uh or if we
wanted to standardize it, then I would
guess
whoever governs RSS and the Atom and
those things because technically it's
doing the same thing as those those
formats basically giving you a list of
URLs plus optionally some extra
metadata and probably that's why we are
also O using we Google search is also
using RSS and ATOM as uh uh discovery
sources. Confusingly enough they have
RSS advisory
board. The RSS advisory board owns RSS
and that's a separate standards body
apparently. Wow. Who owns XML? I think
XML is uh What did you search for? Atom
XML.
I searched for RSS and RSS is is owned
by the RSS advisory board. Huh. Ah, ATOM
as well. Atom I don't know. Let me see.
Actually atom is ITF. How about that?
Oh, and I found out who owns JavaScript
and I feel very very dumb. Okay. Anyway,
so we we should start um well we should
stop googling I guess um and uh good
good podcasting like what both of us
just I think these these guys this is
the real this is the real vibe of this
podcast really um figuring things out as
we go along JavaScript called ECMAScript
it's owned by the ECMA the European
computer manufacturers association which
is very confusing to me why does the
European computer manufacturers
association.
I didn't even know that we have computer
manufacturers in Europe, but anyway, uh
used to at least. So, okay. So, this is
kind of weird. So, basically, Atom is
owned by ITF. Well, developed by ITF. Um
and RSS
is
[Music]
not which
means we can actually choose whatever we
want like if we wanted to standardize
site
maps then technically we could go with
the ITF
um or try to go with the ITF um and uh
see if they would uh want to adopt it.
And that's because they already have
something that is similar to sitemaps. M
okay.
So it kind of fits in the in the big
picture. And is that how you would
choose a standards body? Because if you
want to make something a standard, you
pro you have apparently you have lots of
bodies uh to choose from. the RSS
advisory board, the IETF, the ILE E,
probably the ECMA apparently,
um, the W3C, the what
WG, what does it like? What does it
mean? Why should I go with one body over
the other?
So you
would probably go with one body or the
other
um because it fits their purpose better.
So for example with I I I guess sitemap
XML is not that great because there you
can choose like multiple stuff or multi
multiple bodies. Um but if you say
uh you wanted to create a new
uh or a replacement for TCP Mhm.
then there's only one place for it where
you would do that where you would
develop it and that's the
ITF. Um and that's
because as the way I understand it is
that the people who would
know
enough about previous standards related
standards are there. So basically you
have a community that is expert on the
topic and
then it's more likely that you are going
to end up with something that is
actually usable on the internet because
the discussions will lead to a better
standard. Ah okay. So whoever is best
positioned to help you make the right
decisions in the standard and get it
adopted as widely as possible that's
where you should take it. Okay. Got it.
Yeah, that's that that's the way I
understand it. Um, makes sense. Which
might be completely off, but um in in my
Yeah, in my brain it makes sense.
Um that that does make sense. Yeah. And
then once once you picked a standards
body, then what?
Yeah, that's a good question. In case of
ITF,
there's there's a
I I think there are two ways. Uh well
actually three ways I guess. So at ITF
you can publish something like an
informationformational RFC uh which
doesn't become like a standard. It's
basically an opinion piece if you like.
Um and pretty much anyone can publish
that I I think. Um
and there's minimal
scrutiny with approving them. Um and
then you have
uh if you want to actually go for a
standards track um document or end
result then you have two options. One
is um finding a working group within the
IETF that would adopt
um or babysit your draft uh that you are
writing. Meaning
that you find the group that has the
most expertise for the stuff that you
are
writing. So if you
are I don't
know when you were developing or when we
were developing quick uh uh QIC
um which
is like an a
a kind of a replacement for TCP IP not
not quite like but now with
HTTP3 it's actually getting used quite a
bit
Um, you would look for the working group
that developed TCP IP which is a very
low number standard probably not one but
somewhere there. Um, because it is the
very basic building block of the
internet. Um, and then you would go to
that working group and you would ask
them, hey, what do you think about this
new idea that I have? Uh, here's a draft
that I wrote up. Um
um would uh would you see this as a good
fit for your working
group? And then if they say yes, then
you knock yourself out and start
developing further with input from the
working group. um they will raise lots
of concerns. Um they will have very good
feedback about pretty much every letter
in the in your draft and then eventually
after probably years of um uh iteration
you end up with something that you can
get
um consensus on. meaning
that everyone agrees that this is a good
standard uh proposal and it should
become a standard. Okay. So if you can't
find a working group or you don't know
about working groups in general at IDF,
then they have a dispatch
list and you can email that dispatch
with your
idea including the a link to your draft
and then people will start arguing about
which working group should this belong
to? um or whether it should be
independent track.
That sounds interesting, but it sounds
pretty low
barrier. How many people are are doing
that? Just like emailing their ideas. Is
is that happening often?
Um I don't know. I don't I don't follow
the the dispatch list. Um I'm following
HTTPLS and AI prep. Um but um I would
imagine not so
many
and I think there's a good reason for
that and that is
that technically the internet is in
pretty good shape when it comes to the
technologies that that make the internet
happen. So it there's not that much need
for for new stuff and replacement stuff
and
and yeah it just doesn't happen. And
then when there is something new
then people would directly go to the
respective working groups instead of
dispatching
um their
draft because whether we like it or not
the the community that develops the
internet is pretty tight. So they they
kind of know where to go when um when
they have a new idea. It's not
like Gary woke up one day and hey, I
want a
new AI
nonsense.ext
and I will make it happen on my
own. They just know where to go. So you
go there, you you find a working group,
you build your request for comments
thing and then you get like lots of
feedback and at which point is it that
this becomes a
standard?
Oh
well, it probably takes years
to for something to become a standard.
And when I say probably,
it's I'm I'm underelling it. So
basically it just takes years to for
something to become a standard.
Um there's a good reason for that. Um
basically standards shouldn't be
issued
lightly because they are going to govern
something. There's lots of things to pay
attention to when you are developing
especially internet standards
because there are for example bad actors
on the internet who are going to try to
exploit the stuff that um you are
developing let's say
that I don't know robots txt
um there's a risk that someone could uh
create a buffer overflow in or parsers
for example and then exploit that to
their advantage somehow like it it
cannot happen because it happens in
isolation but uh or it or the parsing is
done in isolation but uh let's say that
if we hadn't put 500 kilobyte limit on
robots txt
file then people would be able to
cause a buffer overflow buffer overflow
like try 4 gig to uh like a 4 gig file,
4 GBTE file to see if that would cause
damage to the parser, namely a buffer
overflow or try with 64 gigs or
whatever. Um, and then once you have
that buff buffer overflow, then you have
access to memory blocks that
um you could exploit to your advantage.
Um and these are things that when we are
developing the standards we pay
attention to. So basically when I'm
reading a draft then I would look at
um how I would exploit
stuff that the standard is
describing and then make recommendations
to the draft or to the authors saying
that hey I think like this particular
bit could be exploited.
How
about you add a 500 kilobyte limit um to
the uh parsing limit? Because then
basically you are like the the plan of
the attackers or potential attackers is
foiled from the beginning. It's like
like with 500 kilobytes you're not going
to
exhaust memory. Mhm. Or if you do then
you have a different problem.
To me, often it feels like people are
nitpicking on stuff, but they are
nitpicking for a very good reason, and
that is that these standards have to
work everywhere for a long time um
without
fault or as little fault as possible.
Yeah, they are going to nitpick every
single little thing uh in the draft and
make recommendations about how to
improve it. And that can also
go like really weird because sometimes
it's nitpicking about the language
that's used in the draft. Oh, okay. So,
for
example,
you haven't explained clearly enough how
parsing should be done when there's an
empty line between two rules, for
example, in robots DXT. And then you
would go back and refine your draft um
and add more words to a sentence to
better explain
um uh that kind of stuff. I think in our
case we were very lucky because we had
our tech writer with us um and uh Lizzy
is really good at noticing
um deficiencies in in the language that
we are using. Um like when you or I
write something um on first read when we
spit out our uh our drafts on first read
it's like the England is very weird.
Mhm. Sometimes because England know our
first language. Yeah.
Yeah. And then uh Lizzie would come in
and she would clean it up and ask
questions about like this is what you
meant or this is what you meant because
depending on where you put I don't know
the comma might mean different things.
Mhm. Um so yeah that that's that's one
thing and the other thing is that
especially in ITF uh in ITF standards we
use um certain keywords that uh have
wait um and that would be stuff like uh
shall not shall or may or must u
must and in like if you're reading RFC's
those are capitalized and that's because
Those are special meaning keywords.
Well, not special meaning, but um they
have special weight in the
documentation. And when you're read
well, not documentation in the standard
and when you're reading it for as an
implement, then you have
to understand that if something says
must, then you actually have to do it.
So let's say
that in case of robots txt you have rows
and then the
rows contain a key value pair right and
then in the draft it would say that the
key must be separated from the value
using a
colon. And then when you are writing
your parser then you know that there's
no wiggle room
there. It must be uh it has to be a
column that that is the separator. Um
but then like for for example with a
parsing limit like we do want to allow
some wiggle room like for example if you
know that you are absolutely certain
that you cannot get a buffer overflow uh
from uh two large um robots txt or very
large robots txt files then you could
say that um we impose or uh or parsers
uh should parse the first 500,000
kilobytes of robots
txt and then you know that okay it's not
a hard
limit it's basically like a lower limit
of how much I have to parse and then if
I want to parse 700 gigs worth of robots
txt file I can do it like there's
no the standards limit set to that like
hard limits so yeah those those keywords
mean a
Okay. Um, I don't actually know where I
was going with this. It takes a long
time for a standard to get made because
they have to be longived and eventually
all these comments are addressed and
everyone in the working group is happy
with it. So, does it automatically
become a
standard? Okay. So, so you can push back
on comments like it's, hey, I took your
comment in consideration. Here are my
reasons for using the current language
or the current structure or whatever.
Um, and I think your comment um will not
should not be applied to my
draft. And then you can go back and
forth and convince the other person to
basically accept whatever you already
have. Um or you address the comment.
Yeah. And
then basically implement the the the
change that you were asked to implement.
And then once there are no more uh
comments from the working group, then
there's something called a lost call.
Mhm.
meaning that
uh and I I actually monitored that list
uh with more most
um rigorously I guess. Um
basically the working group believes
that or
the shepherd of the document believes
that uh all comments were addressed and
there's uh there haven't been new
comments for a while. So if you have
something to say, say it now or be
silent forever because we are going
ahead with standardizing this. Um
another interesting thing that there's a
bunch of different directorates in in
the ITF um that will need to re to
review
um uh a document or a draft during the
during the last call. Um so for
example the structure of the RFC
actually matters or the draft matters
because
um it may be used to or the the drafts
may be parsed by machines and then um it
needs to follow like some some very
specific structure or even just the
publishing engine that they are using
that will need to be able to parse it.
references for example it matters where
you put them um it matters
what kind of reference you are using
there's informative where you are just
like informing that hey this I don't
know like from robots txt we we
reference sitemap XML as an informative
reference um but then there's
also normative references so for example
when we are talking about HTTP headers
we call in uh uh a normative reference
to the HTTP standard uh like
9110 because we make claims about how
something should be parsed for example
in the HTTP header or how something
should be interpreted in the HTTP header
and then a normative reference basically
the the a link to the doc that defines
some behavior.
let's
say and then someone extremely familiar
with the topic that you're discussing in
your draft is going to be asked to
review your draft. Mhm. To see that no
one actually missed anything that can be
missed. So basically you get like a
final review and then once everyone is
happy with like every director at um uh
is happy with your draft and there's no
more last call comments then it can be
moved forward with standardization and
then there's at least two kinds of
standard standards um that I can think
of at ITF at least and one is proposed
standard uh which basically It is a
standard but it's not immutable.
Ah meaning that technically it could be
changed or that's how I interpret it.
And then there's actual standards like
internet standards like std1 standard
one um which defines uh
immutable
standard that the internet uses. Okay.
like you can add extensions to
TCP but technically it's immutable like
the the way it is right now that's how
it has to die. I think that's that's
reasonable then you can at least rely on
that and if you can still extend it
without changing the underlying
fundamental standard I think that's fair
right yeah there might be limitations
but hey but that's it okay so
then everyone agrees or you have uh
explained why something got excluded and
then it becomes uh a a standard by
basically being reviewed a final time by
the directorates and then there's a last
call period and then then we have a
standard ta. Okay, that's pretty cool.
You you said that takes years. Yeah. Is
that is that because the the consensus
takes so long or do you need to have
like a reference implementation or
something or how does that how how come
that it takes so long? I think both.
Um I think it's both.
So you have to show that the thing that
you are working on actually
works. Um and for that usually when we
are in a like in the TLS working group
you would have um adoption calls for for
new drafts if they don't have someone to
work with already. Like let's say that
uh Martin came up with this new
brilliant idea and uh needs
someone to implement it as a
test to show that it actually
works like have a proof of concept I
guess like you need to show that it
works. Um, and then the other thing is
that like there's especially
with certain drafts, there's lots of
back and forth on the mailing list
about particular sections of of
something or even like the the the the
general topic that you are discussing in
your draft. And
then argument is not not the right like
basically it's just like civil and
constructive discussion about the the
draft or sections of the draft or
multiple sections of the draft. Um, and
you know the internet like people have
opinions about stuff and then you have
to decide
uh whether you address the comments.
Um, and then you are able to move
forward with your
draft or take a step back and
maybe revise that
whole paragraph or the whole draft to
exclude the part that people are upset
about, okay, or nitpicking on. Um, so
basically there's tons of iteration
going on. Um
and
it makes the process very slow. Mhm. But
for a good reason like as we said that
like these standards actually are used
by sometimes the whole internet um like
in case of TCP for example like the
whole internet is using it. So it has to
be uh ironclad. Mhm. There's there's no
wigger out there. Yeah. I mean there's
the saying if you want to go fast go
alone if you want to go far go as a
group and I think this is one of the
examples where going slower improves the
quality and longevity of
uh the outcome and all of this is public
right it's not h happening behind closed
doors okay no uh everything is public um
and also our meetings are are public um
so technically anyone can join in and uh
listen to what we are talking about um
or
even just say words in the meeting um
like there's no formal
membership. You just you can just show
up and contribute to to standards.
That's really cool. I I don't know how
it works with other other standards
buddies, but at least with ITF, you can
just show up and say what you have to
say in in our meetings. Um, for formal
meetings there's uh usually an entrance
fee. Um, but
otherwise you can just show up. I mean
the fee is probably to cover the cost of
the location and all the logistics of
making it happen. Yeah, that's fair.
That's fair. I mean like the the last
ITF meeting I went to that was in
Bangkok, it was a week long including
the hackathon. Um and
they or we were using three
floors of the hotel. M
um like literally all the meeting rooms
that the hotel had available.
Um so like it it must
cost an enormous amount of money. Yeah.
Um, so they they have to cooperate
somehow because they are they they don't
have a profit. And as far as I know,
that's pretty much the case for most of
the standards bodies because I know that
for
W3C, it's very easy to set up an
account. It's free. Uh, you can start
your own working group and then do your
thing and then eventually you come out
with something that looks like a
proposal and then other people are being
invited to comment on it and then that
whole process happens. but it's all
public as well. I'm pretty sure the TC39
which governs JavaScript or ECMAS script
is doing more or less the same thing and
I'm pretty sure that what WG does so as
well and I think they are even on GitHub
if I'm not mistaken. So all pretty
transparent processes which is pretty
cool I think. Yeah. Yeah. No, that was
interesting. So that's how a standard is
made. That's how the sausage is made
from the inside. Um wow. So okay, if we
had a bunch of years and enough
motivation, we could make for instance
site maps a standard. That's
interesting. I mean, we could. There's
also like probably you have to sit down
and figure out whether it's worth it
because it's not
really like it's a simple XML file. So
it and and there's not that much that
can go wrong with it. So it's like I I
was thinking about um submitting a
proposal about for standardizing it but
then I was thinking like but why like
what what's the benefit because with
robots txt there was benefit um because
we knew that um different
parsers tend to parse robots txt files
differently and then if you have a
standard then at least you fix that uh
potentially with sitemap it's like eh
Yeah. Yeah. If it's not a standard, then
then what? Okay. So, you have to weigh
the benefits. Um, and as you said, like
one of the benefits is that you can kind
of make things more reliable across
different products from different
vendors, I guess. Okay. I mean, with
with those de facto orformational
standards. Yes. So, what are the
benefits? So, why would you do it? Why
what what did you get out of it with
Robot CXT? it's that we know for certain
that now we are in a better place when
it comes to parsing robots txt file than
we were 10 years ago. Um it also allowed
us to
um to open source our robust txt parser
and then people start building on it. um
which also helps with
um creating better robots txt files I I
would
imagine
and like
having robot like robots
txt at least to me but I I think also
for pretty much every search engine is a
super important thing and then if we can
agree on how robust txt files should be
parsed and there's
less strain on site owners. Oh, fair
like trying to figure out like how to
write the damned
files. Um, so it works for for everyone.
I see. Like every consumer of robots txt
files. Um, and to me that was like
that's nice for the community and nice
for the internet itself. Okay, that
makes sense. That was really cool. Thank
you so much for taking me on this
journey of how the web standards,
internet standards and all that are
made. I've I've never been part of of uh
that kind of work in the IATF. So that's
that's interesting. And I think and
whose fault is that? It's mine, I guess.
It's mine entirely. And I think that's
it for this episode as well. Um if
people want to find out more of this,
then um check out the IETF, check out
the W3C and all the other standards
bodies. They have pretty good websites
that explain how these processes work
and how you can contribute. Um maybe
check out the dispatch from the IATF.
There might be interesting things coming
that you are looking to to be part of. I
don't know. Um yeah. Anyway, thank you
all folks for listening and uh goodbye.
Bye-bye.
We've been having fun with these podcast
episodes and we hope that you, the
listener, have found them both
entertaining and insightful, too. Feel
free to drop us a note on LinkedIn or
chat with us at one of the next events
that we go to if you have any thoughts.
And of course, don't forget to like and
subscribe. Thank you and goodbye.
[Music]