How Googlebot Crawls the Web
2025-05-29 ยท en automatic
[Music] Hello and welcome to a new episode of Search of the Record, the podcast coming to you from the Google Search team where we talk all about search and maybe have some fun along the way. My name is Martin and I'm a I'm a I'm a job title. Oh boy, we should update these notes. Uh, my name is Martin and I'm a search relations engineer on the search relations team at Google. But I'm not alone today. With me is Gary Mo. Gary. Gary. Yes. Are you here? I here. Are you there? You you you hear me? Okay. Good. Good. Gary, I have a question. Okay. Um, I noticed that we recently updated the crawler list and someone um reached out to me and said like they were crawled by Google. I'm like uh I don't think that's a thing we like a user agent we use, but apparently that was one we used and I'm not sure how that happened, but I I think that was a whoopsie. But um how maybe we should actually explain how Google bot works and what that part of the pipeline is. Uh what do you think? Should we talk about Google bot? We should talk about Google bot. I mean technically that was correct like it they were crawled by Google. Fair. Yeah. Coming from a Google IP address. Yeah. Like technically correct. And we all know that the best kind of correct is technically correct. It's technically correct. That's true. But um let's talk about it. Let's start with something like very obscure like let's just call it crawler first. And it's been around for as long as Google itself. Well, actually probably predates it because for starting an a search engine, you need a crawler, right? For a bunch of things, you need a crawler. Yeah. Like when you go out pubbing or something. Ah. Oh, okay. No. H no. If if you want to use data that is on the network, you need something that requests it, right? And I think isn't that what a crawler fundamentally is or part of what a crawler is? Crawler probably does more than that, right? I mean, yeah, but technically crawlers are just HTTP clients, right? Like much like your browser, which is also more advanced I guess HTTP client because it can do more things than just fetching data from over the network. Um, but technically crawlers are just really dumb browsers maybe. I mean there's this library and also a command line utility called curl, right? C URL that's a crawler kind of I guess. I mean you can you you can use it as a crawler like I mean worst case scenario you would write a shell script for example to loop through a set of URLs that you pass on to the curl thingy and then it fetches the stuff for you and then you save it to disk and technically that's what a minimal crawler is as well I think in in I don't know if if it is in curl But in wget you definitely have an option to recursively crawl something or fetch something. So basically it will attempt to extract URLs from the blah that you fetched from a particular URL and then it will attempt to access and download those URLs and then you can set the depth limit like how deep do you want to go from the initial URL uh and I think also whether you want to stay on domain or not or something like that but technically that's that's already a crawler Right? Because you go out on the internet, you find URLs starting from one URL. You find URLs and then eventually you will re or fetch those URLs as well that you just found. I think there was something um Sergey Sergey Brin our co-founder I think well yeah I I know that he's a co-founder so not think but anyway uh I think he said that um uh this was very very early on that if you take uh very popular page again this was like mid '90s or end of ' 90s or something like that 1990s for the very young listeners and uh if you take a very popular page like the homepage of CNN or Wall Street Journal, Fox Fox News or whatever and you can just follow the links where follow means that once you found the a URL on a page, you will just fetch it again. you will fetch that URL and then so on basically recursively just fetch URLs that you found on the internet and if you start from a very popular page you can actually crawl the whole internet just from that one starting point now obviously that that doesn't hold true anymore but yeah it it was a much simpler internet and it was much easier to fetch back then but I guess if I were to write a shell script that loops over a list of URLs and maybe even extracts URLs from these URLs and then keeps going. There's probably more to it than just that because the internet has grown quite a lot and I can imagine that that approach won't work today. I mean it depends what you want to do right because if you just want to crawl set of pages from a site for example like you want to mirror your site uh locally then technically you could do that. I I think there are other problems that need you need to take in consideration especially nowadays like there's so much automatic traffic on the internet on sites that like if if you want to be on your best behavior then you want to at least support robots txt uh like the robots exclusion protocol and have some system I guess I wanted to say algorithm but I don't it's not like a singular thing like have a system that monitors the health of the host and backs out like slows down if the host is becoming unhealthy. Ah so it adjusts crawl rate basically. Yeah. Mhm. because you don't necessarily want to be a an ass and you want to kind of behave, right? Mhm. Okay. Otherwise, you are just like doing the the server. Yeah. I guess your neighborhood is not going to like that if you bring down all the websites. H. So when when Larry Larry started uh doing his backup system that must have eaten a lot of bandwidth. I guess it's relative. Also, why would anyone name anything backup? Like I always had a beef with that because it's a it's such a weird name to to give to a even even an academia search engine or academic exploration or whatever. Like calling it back rub is just it's creepy. Why? It's a bit odd. Yeah. I mean, it's a backlink and you get something for backlinking and then it's like rubbing someone else's back. But I I like you, Gary, but please don't rub my back. Thanks. You sure? Yeah, pretty sure. Oh, yeah. Sorry. Okay. This is sad. No, but I think I to to your point that that must have consumed lots of bandwidth. Like back in the days, one thing was that pages were way more lightweight. True. Like way way way more lightweight. Uh like I remember when um one of the first sites, this was late 90s sites, a few pages. Well, it was a site, I guess. And the the HTML that I put together was like 7,000 bytes. Mhm. Like 7. That's nothing today. That's like an image basically. Actually, images are probably larger these days. Yeah. Much larger. like it's so tiny that like even if you crawl like hundreds of thousands of them, it's not going to make a dent in the in your in your budget. But on the flip side, bandwidth was much more expensive than nowadays. Uh so they must have had some sort of system to monitor that they're not exhausting the very expensive bandwidth, their bandwidth. But then you also have to take in consideration the sites bandwidth. Oh god. Yes, true. They pay for it as well. So yeah, I I think it was much easier to crawl the internet back in those days when back was coming up online, but it also was trickier for a different reason and that was probably cost. Mhm. But yeah, we we had backup. I don't know how fetching was done for backup. I would imagine that they just had some shell script or something that just fetched all those pages for them um to create the initial index for backup. Again, I'm just making this up. I I actually have no idea, but it was likely not that complicated because the web was so so so so much smaller. But I mean, yeah, I I I was reading u just before this recording the anatomy of a large scale hypertextual web search engine paper that Sergey and uh um Larry published at Stanford and they are talking about 110,000 web pages and web accessible documents for one of the early search engines called Oh, that's cute. Yeah. Worldwide web warm or dubdubdub. This was 94. And then there was the other one from the guy that came up with the idea of robots dx, the robots exclusion protocol. I wanted to say his name too fast. I forgot. Um, and he also had a search engine called WebCroller and claimed to have indexed about 2 million pages. Oh wow. And in today's scale, 2 million is still like that's cute, right? I I think the boundary for someone to worry about crawl budget is what 10 million 1 million something like that. I would like for a single site I would say like 1 million is okay probably and that's pretty much like half the of course it also yeah but I mean you also have to like when when we crawl about like crawl budget or how much load we put on the server um you also have to think about how the site is constructed because if you are making expensive operations to construct the page then of course it's going to put much harder load on sites than a simple HTML site, right? True. That's true. Like for like for example, if you are making expensive database calls, like that's going to cost the server a lot. So yeah, but back then it was much simpler anyway. Wild and um okay, so bandwidth is a thing. We've talked about that. Right now Google does a lot of things that probably need to ingest data from the web or want to invest ingest data from the web. Uh does that mean that we have like lots of shell scripts or how do we handle that these days? Because I think bandwidth bandwidth needs to be taken care of across products. No. Um, yeah. But I mean, back when we only had so we had back rub or Larry and Sergey had back rub, right? And then they launched Google in 96. Yeah. 96. Then they launch uh Google bot basically the crawler that they were using for the search engine. Um, I think they might have named it Googlebot in 99. Like before that it was just like nothing although I know that from the very beginning of Google robots txt was supported. Mhm. So like whatever they were using it was already allowed site owners to opt out from crawling. But then we started having multiple or new products, right? Um like we had Adwords coming out in early 2000s. Um and then AdSense 2003 I thinkish and then you also have some kind of fetching in Gmail which is a 2005 2006 thing. So like for example like fetching the images because you don't want to allow the browser to fetch the images remotely because then you are giving away um users metadata to remote sites. So you want to proxy somehow those image fetches in in emails. Mhm. Um anyway, so more and more products had to do some fetching and for a time I think everything was done with the Google bot which was just like this service that you plugged a URL in and it fetched it for you. And it was always just Googlebot and you could give it a million URLs or just five and it would fetch it for you in the limits of the of the host load um that sites individual sites would have. Not a very nice design when you are designing for multiple products because then people can't really tell apart what was the fetch for, right? Okay. Yeah. Because it's just it just looks like Google bot came and it fetched something and you're like I'm going to be on Google like early 2000s. Gary Gary very excited about being on Google. Um, but then it was actually a fetch uh initiated by let's say Gmail or Adwords or something. So I never end up in the in the Google index and not because I was a spammer. Definitely not because of that. Sure. Sure. Spammy guy. So we introduce new crawlers. Um but that would also mean that with all the engineers, software engineers that we have and computer scientists, every now and then someone would came up with a brilliant idea that oh I will just write my own crawler because I need my own user agent which again is not great because then like from maintenance perspective it's an absolute nightmare And then different crawlers that people wrote might have different policies about like robots cxdt and host load and bandwidth usage and whatnot. So eventually someone had to come up with this idea that okay we will just have this one unified system and you can fetch with it from the internet but you have to specify your own user agent string when you are fetching. Mhm. And then I think in 2006 2006ish Google Adwords comes out with Google AdSot. And then from then on we started having more and more and more um crawlers linked crawlers that is not Googlebot. And all of them behaved the same way. I mean yes. Yes. And that was the nice thing about the shared infrastructure, right? Because then you could have like a a common way to behave on the internet for every crawler that you send out. All right. Okay. But sense that that makes sense. That makes a lot of sense because basically you now are bundling all the so to speak traffic that goes out to websites in terms of crowling through a lens of one piece of code which I think makes a lot of sense. Yeah, the one thing that I could see is unfortunate is what if so I I see that like all of them behave the same way because they are all kind of robotic agents that go out and do something for an automated system. But what if I need to write a piece of code that does more or less the same but is like user initiated? So if a user clicks on something like I don't know I submit something for a review or for a specific product where I specifically say like hey please do this then I'm not sure if if following robots makes sense for instance. It might make sense in some cases, but it might not because it's not really a robot then if I ask it to do something. Um, I mean that's a very philosophical question whether it's a robot or not. And nowadays with all the AI agents and whatnot, there's more and more discussion about this. But yeah, you're right. like when a user is sitting behind the keyboard and wants to complete a specific action. Let's say that they want to load something in a spreadsheet from a specific specific URL. Then you are doing a fetch on behalf of a user. So I think you're right that ignoring robots DXT in those cases is the right way right right thing to do unless the team that is providing that feature actually wants to follow robust. Basically, you might want to opt. The other thing would be that the other thing would be latency because with with crawlers you have like a massive URL database uh from where they take the URLs that they need to fetch. Mhm. And then you have to sort that somehow. And then basically by the time and and then when when a user would fetch then you add that to that bucket to that database. It ends up on the bottom of the list and then you have to wait for the earlier added URLs to be consumed until you reach the URL that the user just added. And that might sometimes take weeks as well like like sometimes it's just like you have no time to to to fetch fast enough or you have other limitations. And then with user agent fetchers, what you can do is that I more or less ignore the signals that the sides give. Mhm. And basically just try to make the fetch immediately and for example in in search console you can see this when you do the the live test site verification or the live test. Well, actually not the live test site verification. The live test is uh is actually a crawler. Oh, yeah. Because it needs to Yeah, it's a it's a high priority, but it's still a crawler. Yeah. Fair, fair, fair. But side verification. Yeah, that makes sense. That's a user trigger thing. And Yeah. Mhm. Yeah. Um and you don't have to wait for it for hours or weeks. It it happens almost instantane instant. Instant I will not say that word. Instantaneously. Yeah. that. Okay. But yeah, I I think you need both of them because like they are different use cases. Mhm. Really? That makes that makes sense. That makes sense. But I in terms of different use cases, it doesn't sound like this is a use case specific to Google. So I guess other people have crawlers as well then. Yeah. And Okay. We were not the first ones to do this, right? Yeah, exactly. Um like uh the worldwide web um operated their controllers before Google was even conceptualized like even before Larry had the idea that hey page rank we could use this to do something. Yeah. And since then we have other search engines and uh I guess yeah a lot of crawlers these days. Do you see like a change in the way that crawlers work or behave over the years behave? Yes. How they crawl? There's probably not that much to to change. But well, I guess back in the days we had what? Uh HTTP 1.1. Mhm. Or HT probably they were not crawling on 0.9 because no headers and stuff like that's Mhm. probably hard. But anyway, uh but nowadays you have uh H2, H3. I mean, we don't support H3 at the moment, but I eventually why wouldn't we? And that enables crawling much more efficiently. Um, because you can stream stuff. Stream meaning that you open one connection and then you just do multiple things. Do multiple things on that one connection instead of um opening a bunch of connections. So yeah, like the the way the HTTP clients work under the hood that changes but technically crawling doesn't actually change. Okay. Um and then how different companies polic their uh or set policies for their crawlers that of course differs greatly. And if you are involved in in discussions at the ITF, for example, the intern engineering task force uh about crawler behavior then you can see that some publishers are complaining that crawler X or crawler B or crawler Y was doing something that they would have considered not nice. Yeah. So yeah, like the policies might differ between crawler operators, but in general the I think the the well- behaved crawlers, they they would all try to honor robots DXT or robots exclusion protocol in general and pay some attention to the signals that sites give about their own load uh or their servers load um and back out when they can. And then you also have the what are they called the adversarial crawers like Marwell scanners and privacy scanners and whatnot. And then you would probably need a different kind of policy for them because they are doing something that they want to hide. Not for malicious reason, but because malware dist distributors would probably try to hide their malware if they knew that a malware scanner is coming in. Let's say, okay, I was trying to come up with another example, but I can't. Anyway, yeah. What else do you have? Okay. Well, um I think Oh, and then and then you have the bad actors, right? They are just like, I just want to crawl half of the internet in 25 seconds. Yeah. They might overpower your server and that is not a very nice thing to happen. Huh. Yeah. Okay, so we have the need to ingest data from the web and then you build infrastructure to do that because it's not a trivial thing and at Google we have kind of like shared infrastructure for that. That's that's pretty cool and it we try to be a nice citizen of the web. So hopefully other crawlers will continue to do that rather than try to ingest the whole internet in 25 seconds. That's that sounds fun, but uh I don't think that's feasible in the long run. Also, for people operating websites, you might just have like random traffic spikes and these traffic spikes might still cost you some money. Yeah, I mean that that's one thing that uh we've been doing last year, right? Like we were trying to reduce our footprint on the internet. M um and of course it's not helping that then like new products are launching or new uh like AI products that do fetching for various reasons and then basically you saved seven bytes from each request that you make and then this new product will add back eight. But like you you like like the internet can handle the the the load from from crawlers like I I firmly believe that the this will be controversial and I will get yelled at on the internet for this but it's not crawling that is eating up the the resources. it's indexing and potentially serving or what you are doing with the data when you are processing that data that you fetch. That's um what's um what's expensive and resource intensive. So yeah, I will stop there. Okay. Before I get in more trouble. Okay, before I put you in more trouble, thanks a lot Gary for explaining uh crawlers to me. And um that's the past and present for crawlers, but what's the future going to look like? Are we working on something or HTTP3 is something that we will eventually get around to I guess. But what else? Yeah, I mean H3 is not going to solve the bigger problems, I don't think. Um like what like we just get the trailers, but you get the trailers with H2 as well. So it's like like it's not going to fix our bigger problems. I So well what do you think are the bigger problems first before we talk about solutions? The web is getting congested and not and it's because like everyone in my uh grandmother is launching a crawler or fetchers or whatever we will have more automatic traffic from from AI agents for example um and other AI shenanigans. So basically the web is going to be more congested but it's not something that the web cannot handle like the the web is designed to be able to to handle all that uh traffic even if it's automatic and it's I I would say that it's in good good hands. If they see that there's some some problem problems with load and whatever then they will just come up with some new technologies that will fix that um or reduce that that issue. What what else I I I really like what common crawl is doing because they release data sets. So basically they have their crawler and then they crawl some parts of the internet and then they release that as a data set. So you don't have to crawl yourself and I think that's very nice because then you basically have the same thing that we have internally basically a single infrastructure doing the fetching respecting robots txt and host load and whatnot. Um, and then you can just consume the data. Of course, internally the you can't just consume the data. That's different. Like you still have to do fetches, but at least the robots exclusion protocol policies and the host load is enforced for for the crawl job that you set up. Mhm. Um, I don't know if we need more of these, but yeah, I I thought it's a good idea and it's a nice idea. Okay. All right. Well, come and crawl then. Uh, that's something that I don't think I looked into. I should probably have a look at that. Well, in that case, thanks a lot, Gary, for giving me a journey through the world of crawling. And um, I do hope that you all out there enjoyed this episode and had a good time. If so, let us know in the comments. Like and subscribe to hear more of our episodes. And also tell us if you want to have an a specific episode for a specific topic. So, with that again, thanks Gary and um enjoy your time listening to this out there and uh bye-bye listeners. Goodbye. Oh god, why? Bye-bye, Gary. We've been having fun with these podcast episodes, and we hope that you, the listener, have found them both entertaining and insightful, too. Feel free to drop us a note on LinkedIn or chat with us at one of the next events that we go to if you have any thoughts. And of course, don't forget to like and subscribe. Thank you and goodbye. [Music]