Google crawlers behind the scenes
2026-03-12 ยท en-j3PyPqV-e1s manual
[MUSIC PLAYING] MARTIN SPLITT: Hello, and welcome to the latest episode of Search Off the Record. In this show, we from the Search Relations team here at Google are trying to give you a glimpse of what's happening behind the scenes. And with me today is Gary. Hello, Gary. GARY ILLYES: Hello. MARTIN SPLITT: How are you doing? GARY ILLYES: I'm great. MARTIN SPLITT: Fantastic. Let me change that. GARY ILLYES: OK. MARTIN SPLITT: I want to talk about crawling. GARY ILLYES: Oh no, no, no, no, no, no, no, no, no, no, no. MARTIN SPLITT: Actually, no. I want to talk about crawlers because I am wondering if we ever discussed how exactly our crawling infrastructure looks like. Because people keep talking about Googlebot as if it's like a sort of almost a living thing, or at least a specific program. But there's no Googlebot EXE that you double click on, and then it launches or something. GARY ILLYES: There's not? MARTIN SPLITT: It works a bit differently. GARY ILLYES: How? What? MARTIN SPLITT: You taught me that. Yeah. GARY ILLYES: Well, you're correct. MARTIN SPLITT: Do we want to elaborate a little bit on that. So how can I imagine Googlebot? How does our crawling infrastructure roughly look like? GARY ILLYES: Calling it Googlebot, that's a misnomer. And it's something that back in the days, perhaps early 2000s, it worked well because back then, we probably had one crawler because we had one product. But then soon after, another product came out. I think that was AdWords. And then we started having more crawlers, and then more products came out, and then more crawlers, and then more crawlers. But the Googlebot name, that somehow stuck. And generally, when we were talking about our crawling infrastructure in general, then we tended to call it Googlebot, but that was wildly inaccurate, because Googlebot was just one thing that was communicating with our crawler infrastructure. I don't know if that makes sense. MARTIN SPLITT: How can I imagine that? What do you mean by communicating with our crawler infrastructure? Googlebot is our crawler infrastructure. No? GARY ILLYES: Well, yeah, that's what I've been saying for the past three minutes. MARTIN SPLITT: Yeah, but can't picture it still. GARY ILLYES: So Googlebot is not our crawler infrastructure. Our crawler infrastructure doesn't have an external name. It has an internal name. It doesn't matter what it is. Let's call it Jack. And I don't know how to put it. It's software as a service, if you like. MARTIN SPLITT: Oh, OK. What's that? SAS. Right? GARY ILLYES: Yeah. And then so Jack has API endpoints, so to say. And then you can call those API endpoints to do a fetch from the internet. MARTIN SPLITT: Right. GARY ILLYES: And then when you do those APIs calls, then you also need to specify some parameters like, how long are you willing to wait for the bytes to come back? Or what is your user agent that you want to send? What is the robots.txt product token that you want to obey? And all these parameters. And we do set a default parameter for most of these things, not all of them, but most of these things. So you can generally omit them, which makes these calls simpler, I guess, because you don't have to specify all the stuff, But otherwise, it's really just an API call to something in the Cloud or on some random data center. And then that will perform a fetch for you as a software developer or a product or whatever. MARTIN SPLITT: That's really nice. So I guess there's also a team that manages it, because effectively, what I'm doing is I'm outsourcing it to someone else. GARY ILLYES: Yeah. MARTIN SPLITT: To make all these decisions for me. GARY ILLYES: So this product-- because we can call it a product at this point, even if it's internal-- this has been around for a very, very, very, very long time. So technically, it's been around since Google existed. There were some changes to it because the original version that was more or less just a wget that was running on some random engineer's workstation. So if we think back 1998 or '99, and then as more products came out, the more staffing it needed, for example, more resources it needed. And of course, we needed to re-architecture the whole thing to enable teams to call this service. But in essence, it's always been doing the same thing. It's basically, you tell it, fetch something from the internet without breaking the internet, and then it will do that if the restrictions on the site allow it. That's it. If I wanted to put it in one sentence, that would be it. MARTIN SPLITT: OK. So basically, I hand over a bunch of configuration and say, part of that configuration is the bunch of URLs that I want crawled. And then I hand that over to the servers, and then they come back with something to me. Right? GARY ILLYES: Yeah, pretty much. MARTIN SPLITT: And that something probably is the HTTP response, and the headers, and the body, and maybe some additional metadata. Cool. So basically, Googlebot is just a piece of this configuration that I hand over. A name, basically. GARY ILLYES: Say that again. Sorry. MARTIN SPLITT: So Googlebot is not really a program, but a piece of this configuration that I hand over, basically, just a name of the configuration, so to speak. GARY ILLYES: Well, it is one of the callers of the SAS. MARTIN SPLITT: OK. GARY ILLYES: It's not even part of the configuration. It's just the name that one particular team is using for their fetches that are sent to this central SAS. MARTIN SPLITT: So basically, one of the clients. GARY ILLYES: One of the client. Yeah, exactly. Exactly. MARTIN SPLITT: Well, that suggests that there's other clients. GARY ILLYES: Yeah, sure. We try to document a big chunk of them, but Google is a big company. So there's lots of teams that want to fetch from the internet. So there's lots of crawlers, lots of named crawlers, which means that we would need to document dozens, if not hundreds of different crawlers, or special crawlers, or fetchers. And on a simple HTML page, that's kind of infeasible. So we try to draw a line and say that if the crawler is really tiny, meaning that it doesn't fetch too much from the internet, then we try not to document it. Because the real estate on the crawler site, on developers.google.com/crawlers is actually quite valuable. We might try to deal with that differently. But for the moment, basically, just the major crawlers, and special crawlers and fetchers are documented, because quite literally, because of lack of space. MARTIN SPLITT: You say fetchers and crawlers. What's the difference? GARY ILLYES: So the simplest way to explain it is that crawlers are doing work in batch, and then fetchers do work on individual URL basis, meaning that you give a URL to a fetcher, and then it will fetch just one URL. You cannot give it a list of URLs to fetch. MARTIN SPLITT: OK. GARY ILLYES: And then for crawlers, it's a constant stream, usually, of URLs. And it's running continuously for your team and fetching for your team from the internet. And internally, we also have this policy that fetchers need to be, in some way, user controlled. So basically, there's someone on the other end who's waiting for the response of the fetcher. MARTIN SPLITT: OK, yeah. GARY ILLYES: While with crawlers, it's like, just do it when you have the time. MARTIN SPLITT: Right, so if there is an automatic system that consumes the response and then does something whenever it's available, then we can, obviously, treat that differently than if someone clicks a button and waits for a result. GARY ILLYES: Right. MARTIN SPLITT: OK, got it. And that's the difference between fetcher versus crawler. GARY ILLYES: I think so, yeah. MARTIN SPLITT: OK, cool. GARY ILLYES: I'm pretty sure that there's more differences. For example, the IP ranges that they are fetching from are different. But otherwise, it's pretty much the same infrastructure, more or less. It's just performing different or performing the same tasks differently. MARTIN SPLITT: Right. So I guess if we have documented at least like the major crawlers and maybe even fetchers, then people probably know about them. But you said you only document the major ones. So if I were to start a new project and I needed to somehow have people type in the URL and click a button, then you wouldn't necessarily document that specific project if it's small enough, right? GARY ILLYES: Yeah, exactly. Basically, the trigger for us documenting it, I spent way too much time coming up with, basically, something like SQL queries to trigger alerts for us internally when a crawler or fetcher passes a certain threshold of number of fetches per day. And if that alert triggers internally, then we would get a bug opened, an issue opened internally that would say that, hey, there is a new large crawler in town, and perhaps you want to document it. And then we would go and look at the properties of that crawler, what it's doing, why it's doing it. We would check the theme to ensure that they are not doing something accidentally, because we also had instances where we got a complaint about the crawler doing something on a site, and then we looked at it, and the team was like, no, that crawler is unlaunched. We unlaunched it two years ago. That's not possible. And then we were looking at our logs, and yeah, it was fetching. And then we tracked it down that there was some random job that they forgot to turn down when the project was sunset, that they forgot to turn off that job, and it kept fetching from the internet for no good reason. But nowadays, that's really rare because we have all these monitoring and all these checks in place to ensure that the fetches that we are doing or crawls that we are doing are actually-- or they actually have some utility internally, not just randomly fetching. And on the utility node, there's also really aggressive caching on our side internally. And that's regardless of the HTTP caching mechanisms. So for example, if, let's say, Google News fetches something 10 seconds ago, then does it make any sense to go out with another crawler who's supplying data to web search and fetch that thing again? It probably doesn't. So basically, we just hand it the copy that we got 10 seconds ago to avoid these things. But then there's also tricky things where different projects might have different policies about reuse of content fetched for something else. Let's say that something random, like AdWords cannot reuse content that was fetched for a web search. MARTIN SPLITT: That makes sense. And you said something about a job that was still running. So I'm guessing this infrastructure is huge and has to consume a lot of URLs every day. So I'm guessing we're not running this from your computer on your desk. GARY ILLYES: Right. So this is going a lot into our infrastructure. But imagine the same way that Google Cloud has those runner instances, or whatever they are called. We would have something similar internally. So basically, I can bring up a job on some remote server in some random data center in Atlanta and run my job there. And the job would be a C++ program that I compiled into a bin file and run it from there, or run it as a bin file. MARTIN SPLITT: OK. GARY ILLYES: But within that program that I compiled, I would make the API calls. So basically, I can instruct that program to address to an API endpoint to that SAS crawler infrastructure thingy, and instruct a crawl, or set up a crawl, or whatever. So yeah. MARTIN SPLITT: Do I have to do that manually, or is it smart enough to try to schedule an egress point that makes sense, for instance, if something is geo-blocked? GARY ILLYES: Oh, pet peeve. MARTIN SPLITT: [LAUGHS] GARY ILLYES: Geo-blocking is interesting because, generally, we don't have the infrastructure for handling it. So the typical egress points that we have, like the IPs that start with 66, like 66, 129, blah, those are assigned countries US. And if you dig into it, it's going to be Mountain View, California, which means that-- and we have this in the talk-- that we are typically crawling from the US. And when someone is geo-blocking, then our typical crawler will have an IP address from that location from California. And we will not be able to fetch. We are most likely going to get some sort of error, either an HTTP error, let's say, a 403 block, or some sort of network error. Let's say, connection timeout, like some random router that had the firewall setting to block requests outside of specific regions that would just drop the connection. It wouldn't even send back an echo. And the way we deal with this is trying to find IP addresses within our assigned pools that have a location set to a different country, and then lease those IP addresses for the crawling infrastructure. But these egress points were not designed for high capacity crawling. So they don't have the capacity to handle crawl for everyone in, let's say, Romania, or in Germany, or Switzerland. Well, Switzerland is tiny, so maybe, yes. So we are very frugal when it comes to assigning crawls to those IP addresses. But technically, we kind of can. And sometimes we do, especially if we know that the utility of that content is very high. So it's a really bad example, but let's say if enough people search for blue-eyed Martin-- MARTIN SPLITT: Oh, God. GARY ILLYES: What? MARTIN SPLITT: My eye color comes up again. All right. GARY ILLYES: That literally never comes up. Anyway, if someone is searching for blue-eyed Martin, and we know that there is a site in Germany that has that content, then we would make an effort to address from Germany to be able to fetch that content if the content otherwise would be blocked or geo-fenced. And again, this was a bad example. Don't quote me. Let's say that John said this, my manager. But in theory, that's how it works. MARTIN SPLITT: All right. GARY ILLYES: It's a very, very, very bad idea to rely on this. MARTIN SPLITT: OK, so no geo-blocking for Googlebot if you reliably want to be crawlable. I see, I see. But another thing that comes to mind is, yeah, there might be people geo-blocking things, but in general, it's a lot of traffic that a crawler can generate. Are we having some sort of behavior rules or best practices for our side of things? Because I guess if I build a project and I say, hey, Google crawling infrastructure, here's my configuration, please crawl these bazillion URLs every hour, will they just do that? Or is there some sort of guidance and how our crawlers should behave? GARY ILLYES: So how our crawlers should behave. MARTIN SPLITT: Because you can overwhelm the internet, basically, right? GARY ILLYES: Right. So that kind of thing is handled at the infrastructure level. And basically, it's actually one of the reasons why we have that infrastructure, because we need to be able to force teams to not break the internet. Let's say that I'm a new engineer, and I come to Google, and I sit down, and I quickly get access to one of the machines in a data center, and I start scripting. I write a bash or a shell script, open a socket, and start streaming in data. That particular server has a 10 gigabit connection, and I go to martinsplitt.com and start streaming in the data 10 gigabit per second. I think that your server, or at least your hoster, is not going to that. MARTIN SPLITT: Yeah. GARY ILLYES: So what we are doing instead is that, generally, you cannot egress directly from the servers that are running in our data centers, unless you are calling one of the fetch services, like one of the crawler infrastructure endpoints, and egress through those, because the crawler infrastructure has the capacity to say that, OK, this website, martinsplitt.com, started slowing down on repetitive fetches. So basically, from the baseline, the connection time just went up, and up, and up, and up, and up, and we have to slow down. And then it will throttle the requests that it's sending to martinsplitt.com. If it gets a 503 HTTP response, then it slows down even more, because that actually means that the server was most likely overwhelmed in some way. But then 403, 404, all those, they don't mean anything. That's just like random client error, like you send the wrong URL or something like that. So yeah, the "please don't break the internet" part, that's in the crawler infrastructure at the infrastructure level. And generally, that's not something that individual teams can control. MARTIN SPLITT: OK, so I can't screw it up with my own project. That's nice to hear. Are there any other general guidelines that the crawler level infrastructure prescribes, so to say? GARY ILLYES: There's a bunch of things that are for our own protection or our infrastructure's protection, like, for example, the infamous 15 megabyte default limit. MARTIN SPLITT: [LAUGHS] GARY ILLYES: That is set at infrastructure level. And basically, any crawler that doesn't override that setting is going to have a 15 megabyte limit. So basically, it starts fetching the bytes from the server or whatever the server is sending, and then there's an internal counter. And then when it reached 15 megabytes, then it basically stops receiving the bytes. I don't know if it closes the connection or not. I think it doesn't close the connection. It just sends a response to the server that, OK, you can stop now. I'm good. But then individual teams can override that, and that happens. It happens quite a bit. And for example, for Google Search, specifically for Google Search, the limit is overridden to 2 megabytes. MARTIN SPLITT: For everything? GARY ILLYES: Well, mostly everything. For example, for PDFs, it's-- I don't know-- 64 or whatever, because PDFs can-- the HTTP standard, if you export it as PDF, I think you said that. If you export it as PDF, then it's 96 megabytes or something. MARTIN SPLITT: I think so. Yeah, it was huge. I remember that. GARY ILLYES: But that means that it would overwhelm our infrastructure if we fetch the whole thing and then convert it to HTML, and blah, blah, blah, and then start processing it. It's just like it's overwhelming because it's so much data. And same goes for HTML. It's the HTML living standard. If you have 14mb-- we're not going to fetch that. We are going to fetch the individual pages, because, fortunately, they also had enough brainpower to have individual pages for individual features of HTML. We can fetch those pages, but we are not going to have anything useful out of the 14 megabyte one-pager of the HTML standard. MARTIN SPLITT: Yeah. GARY ILLYES: So yeah. And other crawlers, I never worked on other crawlers, but other crawlers, I'm sure, have different settings. I could imagine, for example, that even in individual projects, it can have different settings for the same thing. For example, I can imagine that if we need to index something very fast, then the truncation limit could be 1 megabyte, for example. I don't know if that's the case, but I could imagine that to be the case, because if you need to push something through the indexing pipeline within seconds, then it's easier to deal with little data. MARTIN SPLITT: That's true. That's true. I think in general, it is useful to have cleared up this idea of crawling just being a monolithic kind of thing. It is more like a software as a service that search is-- or web search, specifically, is one client to and not a monolithic kind of thing. And as you said, configuration can change. It can even change within, let's say, Googlebot. If I'm looking for an image, we probably allow images to be larger than 2 megabytes, I guess, because images easily are larger than 2 megabytes. PDFs, we allow 64, whatever is documented. We'll link the documentation. But I think that makes perfect sense. And if you think about it as in it's a service we call with a bunch of parameters, then it makes a lot more sense to see, OK, so there's different configuration. And this configuration can change on request level, not necessarily just on-- Googlebot is always the same. Wow, all right. GARY ILLYES: That was something. MARTIN SPLITT: That was a whole bunch of stuff. Yeah, there was a lot of stuff. I think that was useful, though, and I hope that our listeners think the same way. Let us know in the comments below if you're interested in more stuff like this, or if this was useful or not. And subscribe to the podcast and tune in next time. Thanks so much, Gary, for being here with me today. GARY ILLYES: Are you a cop? MARTIN SPLITT: I'm not a cop. GARY ILLYES: Then don't tell them how to live their life. MARTIN SPLITT: I know. I'm just making suggestions here. Rude. GARY ILLYES: Fine. MARTIN SPLITT: OK, fine. GARY ILLYES: Fine. MARTIN SPLITT: Bye. GARY ILLYES: Fine. Goodbye. [MUSIC PLAYING] MARTIN SPLITT: We've been having fun with these podcast episodes. I hope you, the listener, have found them both entertaining and insightful, too. Feel free to drop us a note on LinkedIn or chat with us at one of our next events we go to. If you have any thoughts, let us know. And of course, do not forget to like and subscribe. Thank you so much for listening, and goodbye. [MUSIC PLAYING]