How are web standards made?
2025-04-17 ยท en automatic
[Music] Hello and welcome back. It's springtime and we are back with a new episode for Search of the Record, a podcast coming to you from the Google Search team where we talk about search and maybe have some fun along the way as well. My name is Martin and I am a what am I? Uh developer. No, not a developer. Search relations engineer I think is the official title. Someone excited about I don't know. But I'm not alone in my confusion I guess. Uh with me today is Gary. Hi Gary. Maybe you're a unicorn. No, I'm not sure. You that could be your title. You're a house elf, right? Yeah, I had many titles. I remember titles um maybe 10 years ago because I'm such a cheerful person. Miley uh Miley O uh who was uh working with us back then. Uh she gave me a title chief of sunshine and happiness. Oh. And I was very h happy about that because u it's super ironic. No, that's very accurate, I guess. No, it's very accurate. You're such a boy. Martin, stop lying live. Oh, wait. This is not live. Go on. It's off the record. We can say whatever. Oh, god. Anyway, um, so, uh, that was my, uh, given title. And then the house elf, I don't know where that came from. I think it was someone asked me uh what my title is like someone external and then I was just like fumbling like you did with like whatever our title is and we don't know and then I was just like you know what h how house elf and then it just stuck. I think I gave myself internally we had this tool where you can look up people working at Google. Uh, I think my my title that I gave myself there is open web cheerleader. So that that's a fun one. H, by the way, open web. Oh, yeah. Yeah. Remember Steve? I thought it was cool. What? Remember Steve? Our search engine? The my my butler. What? No, the search engine that we built, you know, the toy search engine that we that we built and uh haven't been speaking about for a while now. Oh my god, that was like that was like 20 years ago. Yeah, it feels like it. But it's, you know, uh I was wondering like Steve had a cool feature um that I think other other people should use more. Can we make that into like an internet standard? It's already a meme, I believe, but we could make it a standard. No. Should we do that? Why? You you've been at a standards meeting recently. I know that you've done that and I know that you have worked with someone to make robots txt a standard. Wouldn't it be cool if we made more web standards like more Well, I'm not even sure if it's a web standard. Is it a web standard? Is robots txt web standard? I don't know. I'm confused. Um so I I I think we have to go back to what a standard is first. Um and then we can qualify them with internet standard or web standard or whatever. Oh okay. Um but I think it largely depends on under what organization you are standardizing something. Um, I work with the IETF, the Internet Engineering Task Force. Within that, a couple working groups, uh, namely AI pref, uh, most recently the the TLS, like the what does that then stand for? Uh, transport security working group. Oh, okay. Um and um yeah, let's like if you want to go and explore this then we should probably look up what a standard is, right? Okay. So I think a standard is kind of like a an agreement amongst a bunch of players in a certain field. Let's say like HTML is a standard because a bunch of groups have agreed that HTML has certain elements and is built in a certain way and what are the things that it has and hasn't and I think for HTML it used to be the W3C the web uh worldwide web consortium um but has recently been well recently has a couple of years back moved to a living standard in under the WATWG. Yeah. Um, so I think like a bunch of people come together form like a forum or a group where they agree on certain things to be true or to to be part of something I guess. I don't know. So in in my head in my head um it's uh typically a document um and drafted by consensus. So basically someone proposes something to a group and then the group agrees uh that it's a good idea and we should continue with it. Um and that's the approval state. Um, and typically it has to be under some sort of institution. Mhm. Or consortium or something. Um, like some governing entity. I I suppose um so yeah basically what you said but using fancier words. Um and we h we have quite a few such entities slash consortiums um that govern standards that we are using on the internet um namely like the thing that I'm working with is the IETF but there's also the W3C for example um that creates standards um and I think that they typically agree upon what they are governing. So for example, the ITF is governing well internet related stuff um that is lower in the stack uh in the internet stack. So for example um uh transfer protocols like uh quick or TCP IP or those kind of stuff HTTP itself um and then I could be completely wrong but because I haven't worked with them but W3C is more related to the markup perhaps. Yeah that is used on the internet. And then we have JavaScript that I don't even know where it falls. Um but that could be like a completely different consortium because for example C has its own thing like its own entity and own governing body. So I would imagine that JavaScript could have the same thing. It has the TC39 the technical committee 39 but I'm not sure which organization that is under because it's clearly part of a bigger thing. But I I'll figure that one out eventually. eventually, right? So, what are we standardizing? Um, so I thought like we had a pretty cool thing where the the Steve could be told what cool things you have on your website um by basically having like a cool.txt. So that's not really markup. It's like robots txt but better. What better? Nothing is better than robots txt Martin. No. So it sounds like a site map. Oh. Oh. So maybe we already have that. Yeah. Huh. Where does the sitemap standard live? I mean we could try to standardize sitemap. Isn't it standardized? It doesn't. Um stepping back like robot cxdt was a de facto standard for what like 20 years or something? Mhm. Or 25 years something something like that. Um and then eventually we we we standardized it under the ITF. Why the IETF? Um because that's what I was familiar with. Um like no particular reason. Um I'm a nerd and uh I read RFC's requests for comments and uh basically internet standards um to understand like how the internet works and what are the new things that uh happen on the internet like new protocols that we can use for stuff. Um so basically we just went with the IETF. Okay. Um so similarly sitemap which was created um I mean the standard was created well the facto standard was created um around the year 2005 2006 or it was announced in 2006 but was created in 2005 something like that. um that is a de facto standard meaning that there um no governing body that adopted it um and pushed for it being an actual standard. Okay. But it's the de facto standard because everyone kind of agreed that we are going to do it like that. Yeah. Okay. Okay. So basically it's so widely adopted in its original form um that basically it became a de facto standard. I don't know what a better word is for de facto but um it's it's basically something that people just adopted. Oh, like writing is not a standard but what it's an informal standard because once you have document and like a larger organization adopts it, it's a formal standard. Yeah, perfect. Informal standard. Then we don't use Latin anymore. Informal. Fine. Um, if I had to guess where we would go with uh or if we wanted to standardize it, then I would guess whoever governs RSS and the Atom and those things because technically it's doing the same thing as those those formats basically giving you a list of URLs plus optionally some extra metadata and probably that's why we are also O using we Google search is also using RSS and ATOM as uh uh discovery sources. Confusingly enough they have RSS advisory board. The RSS advisory board owns RSS and that's a separate standards body apparently. Wow. Who owns XML? I think XML is uh What did you search for? Atom XML. I searched for RSS and RSS is is owned by the RSS advisory board. Huh. Ah, ATOM as well. Atom I don't know. Let me see. Actually atom is ITF. How about that? Oh, and I found out who owns JavaScript and I feel very very dumb. Okay. Anyway, so we we should start um well we should stop googling I guess um and uh good good podcasting like what both of us just I think these these guys this is the real this is the real vibe of this podcast really um figuring things out as we go along JavaScript called ECMAScript it's owned by the ECMA the European computer manufacturers association which is very confusing to me why does the European computer manufacturers association. I didn't even know that we have computer manufacturers in Europe, but anyway, uh used to at least. So, okay. So, this is kind of weird. So, basically, Atom is owned by ITF. Well, developed by ITF. Um and RSS is [Music] not which means we can actually choose whatever we want like if we wanted to standardize site maps then technically we could go with the ITF um or try to go with the ITF um and uh see if they would uh want to adopt it. And that's because they already have something that is similar to sitemaps. M okay. So it kind of fits in the in the big picture. And is that how you would choose a standards body? Because if you want to make something a standard, you pro you have apparently you have lots of bodies uh to choose from. the RSS advisory board, the IETF, the ILE E, probably the ECMA apparently, um, the W3C, the what WG, what does it like? What does it mean? Why should I go with one body over the other? So you would probably go with one body or the other um because it fits their purpose better. So for example with I I I guess sitemap XML is not that great because there you can choose like multiple stuff or multi multiple bodies. Um but if you say uh you wanted to create a new uh or a replacement for TCP Mhm. then there's only one place for it where you would do that where you would develop it and that's the ITF. Um and that's because as the way I understand it is that the people who would know enough about previous standards related standards are there. So basically you have a community that is expert on the topic and then it's more likely that you are going to end up with something that is actually usable on the internet because the discussions will lead to a better standard. Ah okay. So whoever is best positioned to help you make the right decisions in the standard and get it adopted as widely as possible that's where you should take it. Okay. Got it. Yeah, that's that that's the way I understand it. Um, makes sense. Which might be completely off, but um in in my Yeah, in my brain it makes sense. Um that that does make sense. Yeah. And then once once you picked a standards body, then what? Yeah, that's a good question. In case of ITF, there's there's a I I think there are two ways. Uh well actually three ways I guess. So at ITF you can publish something like an informationformational RFC uh which doesn't become like a standard. It's basically an opinion piece if you like. Um and pretty much anyone can publish that I I think. Um and there's minimal scrutiny with approving them. Um and then you have uh if you want to actually go for a standards track um document or end result then you have two options. One is um finding a working group within the IETF that would adopt um or babysit your draft uh that you are writing. Meaning that you find the group that has the most expertise for the stuff that you are writing. So if you are I don't know when you were developing or when we were developing quick uh uh QIC um which is like an a a kind of a replacement for TCP IP not not quite like but now with HTTP3 it's actually getting used quite a bit Um, you would look for the working group that developed TCP IP which is a very low number standard probably not one but somewhere there. Um, because it is the very basic building block of the internet. Um, and then you would go to that working group and you would ask them, hey, what do you think about this new idea that I have? Uh, here's a draft that I wrote up. Um um would uh would you see this as a good fit for your working group? And then if they say yes, then you knock yourself out and start developing further with input from the working group. um they will raise lots of concerns. Um they will have very good feedback about pretty much every letter in the in your draft and then eventually after probably years of um uh iteration you end up with something that you can get um consensus on. meaning that everyone agrees that this is a good standard uh proposal and it should become a standard. Okay. So if you can't find a working group or you don't know about working groups in general at IDF, then they have a dispatch list and you can email that dispatch with your idea including the a link to your draft and then people will start arguing about which working group should this belong to? um or whether it should be independent track. That sounds interesting, but it sounds pretty low barrier. How many people are are doing that? Just like emailing their ideas. Is is that happening often? Um I don't know. I don't I don't follow the the dispatch list. Um I'm following HTTPLS and AI prep. Um but um I would imagine not so many and I think there's a good reason for that and that is that technically the internet is in pretty good shape when it comes to the technologies that that make the internet happen. So it there's not that much need for for new stuff and replacement stuff and and yeah it just doesn't happen. And then when there is something new then people would directly go to the respective working groups instead of dispatching um their draft because whether we like it or not the the community that develops the internet is pretty tight. So they they kind of know where to go when um when they have a new idea. It's not like Gary woke up one day and hey, I want a new AI nonsense.ext and I will make it happen on my own. They just know where to go. So you go there, you you find a working group, you build your request for comments thing and then you get like lots of feedback and at which point is it that this becomes a standard? Oh well, it probably takes years to for something to become a standard. And when I say probably, it's I'm I'm underelling it. So basically it just takes years to for something to become a standard. Um there's a good reason for that. Um basically standards shouldn't be issued lightly because they are going to govern something. There's lots of things to pay attention to when you are developing especially internet standards because there are for example bad actors on the internet who are going to try to exploit the stuff that um you are developing let's say that I don't know robots txt um there's a risk that someone could uh create a buffer overflow in or parsers for example and then exploit that to their advantage somehow like it it cannot happen because it happens in isolation but uh or it or the parsing is done in isolation but uh let's say that if we hadn't put 500 kilobyte limit on robots txt file then people would be able to cause a buffer overflow buffer overflow like try 4 gig to uh like a 4 gig file, 4 GBTE file to see if that would cause damage to the parser, namely a buffer overflow or try with 64 gigs or whatever. Um, and then once you have that buff buffer overflow, then you have access to memory blocks that um you could exploit to your advantage. Um and these are things that when we are developing the standards we pay attention to. So basically when I'm reading a draft then I would look at um how I would exploit stuff that the standard is describing and then make recommendations to the draft or to the authors saying that hey I think like this particular bit could be exploited. How about you add a 500 kilobyte limit um to the uh parsing limit? Because then basically you are like the the plan of the attackers or potential attackers is foiled from the beginning. It's like like with 500 kilobytes you're not going to exhaust memory. Mhm. Or if you do then you have a different problem. To me, often it feels like people are nitpicking on stuff, but they are nitpicking for a very good reason, and that is that these standards have to work everywhere for a long time um without fault or as little fault as possible. Yeah, they are going to nitpick every single little thing uh in the draft and make recommendations about how to improve it. And that can also go like really weird because sometimes it's nitpicking about the language that's used in the draft. Oh, okay. So, for example, you haven't explained clearly enough how parsing should be done when there's an empty line between two rules, for example, in robots DXT. And then you would go back and refine your draft um and add more words to a sentence to better explain um uh that kind of stuff. I think in our case we were very lucky because we had our tech writer with us um and uh Lizzy is really good at noticing um deficiencies in in the language that we are using. Um like when you or I write something um on first read when we spit out our uh our drafts on first read it's like the England is very weird. Mhm. Sometimes because England know our first language. Yeah. Yeah. And then uh Lizzie would come in and she would clean it up and ask questions about like this is what you meant or this is what you meant because depending on where you put I don't know the comma might mean different things. Mhm. Um so yeah that that's that's one thing and the other thing is that especially in ITF uh in ITF standards we use um certain keywords that uh have wait um and that would be stuff like uh shall not shall or may or must u must and in like if you're reading RFC's those are capitalized and that's because Those are special meaning keywords. Well, not special meaning, but um they have special weight in the documentation. And when you're read well, not documentation in the standard and when you're reading it for as an implement, then you have to understand that if something says must, then you actually have to do it. So let's say that in case of robots txt you have rows and then the rows contain a key value pair right and then in the draft it would say that the key must be separated from the value using a colon. And then when you are writing your parser then you know that there's no wiggle room there. It must be uh it has to be a column that that is the separator. Um but then like for for example with a parsing limit like we do want to allow some wiggle room like for example if you know that you are absolutely certain that you cannot get a buffer overflow uh from uh two large um robots txt or very large robots txt files then you could say that um we impose or uh or parsers uh should parse the first 500,000 kilobytes of robots txt and then you know that okay it's not a hard limit it's basically like a lower limit of how much I have to parse and then if I want to parse 700 gigs worth of robots txt file I can do it like there's no the standards limit set to that like hard limits so yeah those those keywords mean a Okay. Um, I don't actually know where I was going with this. It takes a long time for a standard to get made because they have to be longived and eventually all these comments are addressed and everyone in the working group is happy with it. So, does it automatically become a standard? Okay. So, so you can push back on comments like it's, hey, I took your comment in consideration. Here are my reasons for using the current language or the current structure or whatever. Um, and I think your comment um will not should not be applied to my draft. And then you can go back and forth and convince the other person to basically accept whatever you already have. Um or you address the comment. Yeah. And then basically implement the the the change that you were asked to implement. And then once there are no more uh comments from the working group, then there's something called a lost call. Mhm. meaning that uh and I I actually monitored that list uh with more most um rigorously I guess. Um basically the working group believes that or the shepherd of the document believes that uh all comments were addressed and there's uh there haven't been new comments for a while. So if you have something to say, say it now or be silent forever because we are going ahead with standardizing this. Um another interesting thing that there's a bunch of different directorates in in the ITF um that will need to re to review um uh a document or a draft during the during the last call. Um so for example the structure of the RFC actually matters or the draft matters because um it may be used to or the the drafts may be parsed by machines and then um it needs to follow like some some very specific structure or even just the publishing engine that they are using that will need to be able to parse it. references for example it matters where you put them um it matters what kind of reference you are using there's informative where you are just like informing that hey this I don't know like from robots txt we we reference sitemap XML as an informative reference um but then there's also normative references so for example when we are talking about HTTP headers we call in uh uh a normative reference to the HTTP standard uh like 9110 because we make claims about how something should be parsed for example in the HTTP header or how something should be interpreted in the HTTP header and then a normative reference basically the the a link to the doc that defines some behavior. let's say and then someone extremely familiar with the topic that you're discussing in your draft is going to be asked to review your draft. Mhm. To see that no one actually missed anything that can be missed. So basically you get like a final review and then once everyone is happy with like every director at um uh is happy with your draft and there's no more last call comments then it can be moved forward with standardization and then there's at least two kinds of standard standards um that I can think of at ITF at least and one is proposed standard uh which basically It is a standard but it's not immutable. Ah meaning that technically it could be changed or that's how I interpret it. And then there's actual standards like internet standards like std1 standard one um which defines uh immutable standard that the internet uses. Okay. like you can add extensions to TCP but technically it's immutable like the the way it is right now that's how it has to die. I think that's that's reasonable then you can at least rely on that and if you can still extend it without changing the underlying fundamental standard I think that's fair right yeah there might be limitations but hey but that's it okay so then everyone agrees or you have uh explained why something got excluded and then it becomes uh a a standard by basically being reviewed a final time by the directorates and then there's a last call period and then then we have a standard ta. Okay, that's pretty cool. You you said that takes years. Yeah. Is that is that because the the consensus takes so long or do you need to have like a reference implementation or something or how does that how how come that it takes so long? I think both. Um I think it's both. So you have to show that the thing that you are working on actually works. Um and for that usually when we are in a like in the TLS working group you would have um adoption calls for for new drafts if they don't have someone to work with already. Like let's say that uh Martin came up with this new brilliant idea and uh needs someone to implement it as a test to show that it actually works like have a proof of concept I guess like you need to show that it works. Um, and then the other thing is that like there's especially with certain drafts, there's lots of back and forth on the mailing list about particular sections of of something or even like the the the the general topic that you are discussing in your draft. And then argument is not not the right like basically it's just like civil and constructive discussion about the the draft or sections of the draft or multiple sections of the draft. Um, and you know the internet like people have opinions about stuff and then you have to decide uh whether you address the comments. Um, and then you are able to move forward with your draft or take a step back and maybe revise that whole paragraph or the whole draft to exclude the part that people are upset about, okay, or nitpicking on. Um, so basically there's tons of iteration going on. Um and it makes the process very slow. Mhm. But for a good reason like as we said that like these standards actually are used by sometimes the whole internet um like in case of TCP for example like the whole internet is using it. So it has to be uh ironclad. Mhm. There's there's no wigger out there. Yeah. I mean there's the saying if you want to go fast go alone if you want to go far go as a group and I think this is one of the examples where going slower improves the quality and longevity of uh the outcome and all of this is public right it's not h happening behind closed doors okay no uh everything is public um and also our meetings are are public um so technically anyone can join in and uh listen to what we are talking about um or even just say words in the meeting um like there's no formal membership. You just you can just show up and contribute to to standards. That's really cool. I I don't know how it works with other other standards buddies, but at least with ITF, you can just show up and say what you have to say in in our meetings. Um, for formal meetings there's uh usually an entrance fee. Um, but otherwise you can just show up. I mean the fee is probably to cover the cost of the location and all the logistics of making it happen. Yeah, that's fair. That's fair. I mean like the the last ITF meeting I went to that was in Bangkok, it was a week long including the hackathon. Um and they or we were using three floors of the hotel. M um like literally all the meeting rooms that the hotel had available. Um so like it it must cost an enormous amount of money. Yeah. Um, so they they have to cooperate somehow because they are they they don't have a profit. And as far as I know, that's pretty much the case for most of the standards bodies because I know that for W3C, it's very easy to set up an account. It's free. Uh, you can start your own working group and then do your thing and then eventually you come out with something that looks like a proposal and then other people are being invited to comment on it and then that whole process happens. but it's all public as well. I'm pretty sure the TC39 which governs JavaScript or ECMAS script is doing more or less the same thing and I'm pretty sure that what WG does so as well and I think they are even on GitHub if I'm not mistaken. So all pretty transparent processes which is pretty cool I think. Yeah. Yeah. No, that was interesting. So that's how a standard is made. That's how the sausage is made from the inside. Um wow. So okay, if we had a bunch of years and enough motivation, we could make for instance site maps a standard. That's interesting. I mean, we could. There's also like probably you have to sit down and figure out whether it's worth it because it's not really like it's a simple XML file. So it and and there's not that much that can go wrong with it. So it's like I I was thinking about um submitting a proposal about for standardizing it but then I was thinking like but why like what what's the benefit because with robots txt there was benefit um because we knew that um different parsers tend to parse robots txt files differently and then if you have a standard then at least you fix that uh potentially with sitemap it's like eh Yeah. Yeah. If it's not a standard, then then what? Okay. So, you have to weigh the benefits. Um, and as you said, like one of the benefits is that you can kind of make things more reliable across different products from different vendors, I guess. Okay. I mean, with with those de facto orformational standards. Yes. So, what are the benefits? So, why would you do it? Why what what did you get out of it with Robot CXT? it's that we know for certain that now we are in a better place when it comes to parsing robots txt file than we were 10 years ago. Um it also allowed us to um to open source our robust txt parser and then people start building on it. um which also helps with um creating better robots txt files I I would imagine and like having robot like robots txt at least to me but I I think also for pretty much every search engine is a super important thing and then if we can agree on how robust txt files should be parsed and there's less strain on site owners. Oh, fair like trying to figure out like how to write the damned files. Um, so it works for for everyone. I see. Like every consumer of robots txt files. Um, and to me that was like that's nice for the community and nice for the internet itself. Okay, that makes sense. That was really cool. Thank you so much for taking me on this journey of how the web standards, internet standards and all that are made. I've I've never been part of of uh that kind of work in the IATF. So that's that's interesting. And I think and whose fault is that? It's mine, I guess. It's mine entirely. And I think that's it for this episode as well. Um if people want to find out more of this, then um check out the IETF, check out the W3C and all the other standards bodies. They have pretty good websites that explain how these processes work and how you can contribute. Um maybe check out the dispatch from the IATF. There might be interesting things coming that you are looking to to be part of. I don't know. Um yeah. Anyway, thank you all folks for listening and uh goodbye. Bye-bye. We've been having fun with these podcast episodes and we hope that you, the listener, have found them both entertaining and insightful, too. Feel free to drop us a note on LinkedIn or chat with us at one of the next events that we go to if you have any thoughts. And of course, don't forget to like and subscribe. Thank you and goodbye. [Music]