Transcript Collector

Google crawlers behind the scenes

2026-03-12 ยท en-j3PyPqV-e1s manual

Open YouTube
[MUSIC PLAYING]
MARTIN SPLITT: Hello, and welcome to the latest episode
of Search Off the Record.
In this show, we from the Search Relations team here at Google
are trying to give you a glimpse of what's
happening behind the scenes.
And with me today is Gary.
Hello, Gary.
GARY ILLYES: Hello.
MARTIN SPLITT: How are you doing?
GARY ILLYES: I'm great.
MARTIN SPLITT: Fantastic.
Let me change that.
GARY ILLYES: OK.
MARTIN SPLITT: I want to talk about crawling.
GARY ILLYES: Oh no, no, no, no, no, no, no, no, no, no, no.
MARTIN SPLITT: Actually, no.
I want to talk about crawlers because I am wondering
if we ever discussed how exactly our crawling infrastructure
looks like.
Because people keep talking about Googlebot
as if it's like a sort of almost a living
thing, or at least a specific program.
But there's no Googlebot EXE that you double click on,
and then it launches or something.
GARY ILLYES: There's not?
MARTIN SPLITT: It works a bit differently.
GARY ILLYES: How?
What?
MARTIN SPLITT: You taught me that.
Yeah.
GARY ILLYES: Well, you're correct.
MARTIN SPLITT: Do we want to elaborate a little bit on that.
So how can I imagine Googlebot?
How does our crawling infrastructure roughly
look like?
GARY ILLYES: Calling it Googlebot, that's a misnomer.
And it's something that back in the days, perhaps early 2000s,
it worked well because back then,
we probably had one crawler because we had one product.
But then soon after, another product came out.
I think that was AdWords.
And then we started having more crawlers,
and then more products came out, and then more crawlers, and then
more crawlers.
But the Googlebot name, that somehow stuck.
And generally, when we were talking
about our crawling infrastructure in general,
then we tended to call it Googlebot,
but that was wildly inaccurate, because Googlebot
was just one thing that was communicating with our crawler
infrastructure.
I don't know if that makes sense.
MARTIN SPLITT: How can I imagine that?
What do you mean by communicating with our crawler
infrastructure?
Googlebot is our crawler infrastructure.
No?
GARY ILLYES: Well, yeah, that's what I've been saying
for the past three minutes.
MARTIN SPLITT: Yeah, but can't picture it still.
GARY ILLYES: So Googlebot is not our crawler infrastructure.
Our crawler infrastructure doesn't have an external name.
It has an internal name.
It doesn't matter what it is.
Let's call it Jack.
And I don't know how to put it.
It's software as a service, if you like.
MARTIN SPLITT: Oh, OK.
What's that?
SAS.
Right?
GARY ILLYES: Yeah.
And then so Jack has API endpoints, so to say.
And then you can call those API endpoints to do a fetch
from the internet.
MARTIN SPLITT: Right.
GARY ILLYES: And then when you do those APIs calls, then
you also need to specify some parameters like,
how long are you willing to wait for the bytes to come back?
Or what is your user agent that you want to send?
What is the robots.txt product token that you want to obey?
And all these parameters.
And we do set a default parameter
for most of these things, not all of them,
but most of these things.
So you can generally omit them, which
makes these calls simpler, I guess,
because you don't have to specify all the stuff,
But otherwise, it's really just an API call
to something in the Cloud or on some random data center.
And then that will perform a fetch for you
as a software developer or a product or whatever.
MARTIN SPLITT: That's really nice.
So I guess there's also a team that manages it,
because effectively, what I'm doing is
I'm outsourcing it to someone else.
GARY ILLYES: Yeah.
MARTIN SPLITT: To make all these decisions for me.
GARY ILLYES: So this product--
because we can call it a product at this point,
even if it's internal-- this has been around
for a very, very, very, very long time.
So technically, it's been around since Google existed.
There were some changes to it because the original version
that was more or less just a wget that
was running on some random engineer's workstation.
So if we think back 1998 or '99, and then
as more products came out, the more staffing
it needed, for example, more resources it needed.
And of course, we needed to re-architecture the whole thing
to enable teams to call this service.
But in essence, it's always been doing the same thing.
It's basically, you tell it, fetch something
from the internet without breaking the internet,
and then it will do that if the restrictions on the site
allow it.
That's it.
If I wanted to put it in one sentence, that would be it.
MARTIN SPLITT: OK.
So basically, I hand over a bunch of configuration
and say, part of that configuration
is the bunch of URLs that I want crawled.
And then I hand that over to the servers, and then
they come back with something to me.
Right?
GARY ILLYES: Yeah, pretty much.
MARTIN SPLITT: And that something probably
is the HTTP response, and the headers,
and the body, and maybe some additional metadata.
Cool.
So basically, Googlebot is just a piece of this configuration
that I hand over.
A name, basically.
GARY ILLYES: Say that again.
Sorry.
MARTIN SPLITT: So Googlebot is not really a program,
but a piece of this configuration
that I hand over, basically, just
a name of the configuration, so to speak.
GARY ILLYES: Well, it is one of the callers of the SAS.
MARTIN SPLITT: OK.
GARY ILLYES: It's not even part of the configuration.
It's just the name that one particular team
is using for their fetches that are sent to this central SAS.
MARTIN SPLITT: So basically, one of the clients.
GARY ILLYES: One of the client.
Yeah, exactly.
Exactly.
MARTIN SPLITT: Well, that suggests
that there's other clients.
GARY ILLYES: Yeah, sure.
We try to document a big chunk of them,
but Google is a big company.
So there's lots of teams that want to fetch from the internet.
So there's lots of crawlers, lots of named crawlers, which
means that we would need to document dozens, if not hundreds
of different crawlers, or special crawlers, or fetchers.
And on a simple HTML page, that's kind of infeasible.
So we try to draw a line and say that if the crawler is really
tiny, meaning that it doesn't fetch
too much from the internet, then we try not to document it.
Because the real estate on the crawler site,
on developers.google.com/crawlers
is actually quite valuable.
We might try to deal with that differently.
But for the moment, basically, just the major crawlers,
and special crawlers and fetchers
are documented, because quite literally,
because of lack of space.
MARTIN SPLITT: You say fetchers and crawlers.
What's the difference?
GARY ILLYES: So the simplest way to explain
it is that crawlers are doing work in batch,
and then fetchers do work on individual URL basis,
meaning that you give a URL to a fetcher,
and then it will fetch just one URL.
You cannot give it a list of URLs to fetch.
MARTIN SPLITT: OK.
GARY ILLYES: And then for crawlers, it's
a constant stream, usually, of URLs.
And it's running continuously for your team
and fetching for your team from the internet.
And internally, we also have this policy
that fetchers need to be, in some way, user controlled.
So basically, there's someone on the other end
who's waiting for the response of the fetcher.
MARTIN SPLITT: OK, yeah.
GARY ILLYES: While with crawlers, it's like,
just do it when you have the time.
MARTIN SPLITT: Right, so if there is an automatic system
that consumes the response and then does something whenever
it's available, then we can, obviously,
treat that differently than if someone clicks a button
and waits for a result.
GARY ILLYES: Right.
MARTIN SPLITT: OK, got it.
And that's the difference between fetcher versus crawler.
GARY ILLYES: I think so, yeah.
MARTIN SPLITT: OK, cool.
GARY ILLYES: I'm pretty sure that there's more differences.
For example, the IP ranges that they are
fetching from are different.
But otherwise, it's pretty much the same infrastructure,
more or less.
It's just performing different or performing the same tasks
differently.
MARTIN SPLITT: Right.
So I guess if we have documented at least like the major crawlers
and maybe even fetchers, then people probably know about them.
But you said you only document the major ones.
So if I were to start a new project
and I needed to somehow have people type in the URL
and click a button, then you wouldn't necessarily
document that specific project if it's small enough, right?
GARY ILLYES: Yeah, exactly.
Basically, the trigger for us documenting it,
I spent way too much time coming up with, basically,
something like SQL queries to trigger alerts for us internally
when a crawler or fetcher passes a certain threshold of number
of fetches per day.
And if that alert triggers internally,
then we would get a bug opened, an issue opened internally
that would say that, hey, there is a new large crawler in town,
and perhaps you want to document it.
And then we would go and look at the properties of that crawler,
what it's doing, why it's doing it.
We would check the theme to ensure that they are not
doing something accidentally, because we also had instances
where we got a complaint about the crawler doing something
on a site, and then we looked at it,
and the team was like, no, that crawler is unlaunched.
We unlaunched it two years ago.
That's not possible.
And then we were looking at our logs, and yeah, it was fetching.
And then we tracked it down that there was some random job
that they forgot to turn down when the project was sunset,
that they forgot to turn off that job,
and it kept fetching from the internet for no good reason.
But nowadays, that's really rare because we
have all these monitoring and all these checks in place
to ensure that the fetches that we are doing
or crawls that we are doing are actually--
or they actually have some utility internally,
not just randomly fetching.
And on the utility node, there's also really aggressive
caching on our side internally.
And that's regardless of the HTTP caching mechanisms.
So for example, if, let's say, Google News fetches something
10 seconds ago, then does it make any sense
to go out with another crawler who's
supplying data to web search and fetch that thing again?
It probably doesn't.
So basically, we just hand it the copy
that we got 10 seconds ago to avoid these things.
But then there's also tricky things
where different projects might have
different policies about reuse of content fetched for something
else.
Let's say that something random, like AdWords cannot reuse
content that was fetched for a web search.
MARTIN SPLITT: That makes sense.
And you said something about a job that was still running.
So I'm guessing this infrastructure
is huge and has to consume a lot of URLs every day.
So I'm guessing we're not running this
from your computer on your desk.
GARY ILLYES: Right.
So this is going a lot into our infrastructure.
But imagine the same way that Google Cloud has those runner
instances, or whatever they are called.
We would have something similar internally.
So basically, I can bring up a job
on some remote server in some random data center in Atlanta
and run my job there.
And the job would be a C++ program that I compiled
into a bin file and run it from there, or run it as a bin file.
MARTIN SPLITT: OK.
GARY ILLYES: But within that program that I compiled,
I would make the API calls.
So basically, I can instruct that program
to address to an API endpoint to that SAS crawler infrastructure
thingy, and instruct a crawl, or set up a crawl, or whatever.
So yeah.
MARTIN SPLITT: Do I have to do that manually,
or is it smart enough to try to schedule an egress point that
makes sense, for instance, if something is geo-blocked?
GARY ILLYES: Oh, pet peeve.
MARTIN SPLITT: [LAUGHS]
GARY ILLYES: Geo-blocking is interesting
because, generally, we don't have the infrastructure
for handling it.
So the typical egress points that we have,
like the IPs that start with 66, like 66, 129, blah,
those are assigned countries US.
And if you dig into it, it's going
to be Mountain View, California, which means that--
and we have this in the talk-- that we are typically
crawling from the US.
And when someone is geo-blocking,
then our typical crawler will have an IP address
from that location from California.
And we will not be able to fetch.
We are most likely going to get some sort of error,
either an HTTP error, let's say, a 403 block,
or some sort of network error.
Let's say, connection timeout, like some random router
that had the firewall setting to block requests
outside of specific regions that would just drop the connection.
It wouldn't even send back an echo.
And the way we deal with this is trying
to find IP addresses within our assigned pools that
have a location set to a different country,
and then lease those IP addresses
for the crawling infrastructure.
But these egress points were not designed
for high capacity crawling.
So they don't have the capacity to handle crawl
for everyone in, let's say, Romania, or in Germany,
or Switzerland.
Well, Switzerland is tiny, so maybe, yes.
So we are very frugal when it comes to assigning crawls
to those IP addresses.
But technically, we kind of can.
And sometimes we do, especially if we
know that the utility of that content is very high.
So it's a really bad example, but let's say if enough people
search for blue-eyed Martin--
MARTIN SPLITT: Oh, God.
GARY ILLYES: What?
MARTIN SPLITT: My eye color comes up again.
All right.
GARY ILLYES: That literally never comes up.
Anyway, if someone is searching for blue-eyed Martin,
and we know that there is a site in Germany that
has that content, then we would make an effort
to address from Germany to be able to fetch
that content if the content otherwise
would be blocked or geo-fenced.
And again, this was a bad example.
Don't quote me.
Let's say that John said this, my manager.
But in theory, that's how it works.
MARTIN SPLITT: All right.
GARY ILLYES: It's a very, very, very bad idea to rely on this.
MARTIN SPLITT: OK, so no geo-blocking for Googlebot
if you reliably want to be crawlable.
I see, I see.
But another thing that comes to mind
is, yeah, there might be people geo-blocking things,
but in general, it's a lot of traffic
that a crawler can generate.
Are we having some sort of behavior rules
or best practices for our side of things?
Because I guess if I build a project
and I say, hey, Google crawling infrastructure,
here's my configuration, please crawl
these bazillion URLs every hour, will they just do that?
Or is there some sort of guidance and how
our crawlers should behave?
GARY ILLYES: So how our crawlers should behave.
MARTIN SPLITT: Because you can overwhelm the internet,
basically, right?
GARY ILLYES: Right.
So that kind of thing is handled at the infrastructure level.
And basically, it's actually one of the reasons
why we have that infrastructure, because we
need to be able to force teams to not break the internet.
Let's say that I'm a new engineer, and I come to Google,
and I sit down, and I quickly get access
to one of the machines in a data center, and I start scripting.
I write a bash or a shell script, open a socket,
and start streaming in data.
That particular server has a 10 gigabit connection,
and I go to martinsplitt.com and start streaming in the data 10
gigabit per second.
I think that your server, or at least your hoster,
is not going to that.
MARTIN SPLITT: Yeah.
GARY ILLYES: So what we are doing instead is that,
generally, you cannot egress directly from the servers that
are running in our data centers, unless you are calling one
of the fetch services, like one of the crawler infrastructure
endpoints, and egress through those,
because the crawler infrastructure has the capacity
to say that, OK, this website, martinsplitt.com,
started slowing down on repetitive fetches.
So basically, from the baseline, the connection time just
went up, and up, and up, and up, and up,
and we have to slow down.
And then it will throttle the requests
that it's sending to martinsplitt.com.
If it gets a 503 HTTP response, then it slows down even more,
because that actually means that the server was most likely
overwhelmed in some way.
But then 403, 404, all those, they don't mean anything.
That's just like random client error,
like you send the wrong URL or something like that.
So yeah, the "please don't break the internet" part,
that's in the crawler infrastructure
at the infrastructure level.
And generally, that's not something
that individual teams can control.
MARTIN SPLITT: OK, so I can't screw it up with my own project.
That's nice to hear.
Are there any other general guidelines
that the crawler level infrastructure
prescribes, so to say?
GARY ILLYES: There's a bunch of things
that are for our own protection or our infrastructure's
protection, like, for example, the infamous 15 megabyte default
limit.
MARTIN SPLITT: [LAUGHS]
GARY ILLYES: That is set at infrastructure level.
And basically, any crawler that doesn't override that setting
is going to have a 15 megabyte limit.
So basically, it starts fetching the bytes from the server
or whatever the server is sending,
and then there's an internal counter.
And then when it reached 15 megabytes,
then it basically stops receiving the bytes.
I don't know if it closes the connection or not.
I think it doesn't close the connection.
It just sends a response to the server
that, OK, you can stop now.
I'm good.
But then individual teams can override that, and that happens.
It happens quite a bit.
And for example, for Google Search,
specifically for Google Search, the limit
is overridden to 2 megabytes.
MARTIN SPLITT: For everything?
GARY ILLYES: Well, mostly everything.
For example, for PDFs, it's--
I don't know-- 64 or whatever, because PDFs
can-- the HTTP standard, if you export it as PDF,
I think you said that.
If you export it as PDF, then it's 96 megabytes or something.
MARTIN SPLITT: I think so.
Yeah, it was huge.
I remember that.
GARY ILLYES: But that means that it
would overwhelm our infrastructure if we fetch
the whole thing and then convert it to HTML,
and blah, blah, blah, and then start processing it.
It's just like it's overwhelming because it's so much data.
And same goes for HTML.
It's the HTML living standard.
If you have 14mb--
we're not going to fetch that.
We are going to fetch the individual pages, because,
fortunately, they also had enough brainpower
to have individual pages for individual features of HTML.
We can fetch those pages, but we are not
going to have anything useful out of the 14 megabyte
one-pager of the HTML standard.
MARTIN SPLITT: Yeah.
GARY ILLYES: So yeah.
And other crawlers, I never worked on other crawlers,
but other crawlers, I'm sure, have different settings.
I could imagine, for example, that even
in individual projects, it can have different settings
for the same thing.
For example, I can imagine that if we need to index something
very fast, then the truncation limit could
be 1 megabyte, for example.
I don't know if that's the case, but I
could imagine that to be the case, because if you need
to push something through the indexing
pipeline within seconds, then it's easier
to deal with little data.
MARTIN SPLITT: That's true.
That's true.
I think in general, it is useful to have cleared up
this idea of crawling just being a monolithic kind of thing.
It is more like a software as a service that search is--
or web search, specifically, is one client to and not
a monolithic kind of thing.
And as you said, configuration can change.
It can even change within, let's say, Googlebot.
If I'm looking for an image, we probably
allow images to be larger than 2 megabytes, I guess,
because images easily are larger than 2 megabytes.
PDFs, we allow 64, whatever is documented.
We'll link the documentation.
But I think that makes perfect sense.
And if you think about it as in it's
a service we call with a bunch of parameters,
then it makes a lot more sense to see, OK, so there's
different configuration.
And this configuration can change on request level,
not necessarily just on--
Googlebot is always the same.
Wow, all right.
GARY ILLYES: That was something.
MARTIN SPLITT: That was a whole bunch of stuff.
Yeah, there was a lot of stuff.
I think that was useful, though, and I
hope that our listeners think the same way.
Let us know in the comments below if you're
interested in more stuff like this, or if this was useful
or not.
And subscribe to the podcast and tune in next time.
Thanks so much, Gary, for being here with me today.
GARY ILLYES: Are you a cop?
MARTIN SPLITT: I'm not a cop.
GARY ILLYES: Then don't tell them how to live their life.
MARTIN SPLITT: I know.
I'm just making suggestions here.
Rude.
GARY ILLYES: Fine.
MARTIN SPLITT: OK, fine.
GARY ILLYES: Fine.
MARTIN SPLITT: Bye.
GARY ILLYES: Fine.
Goodbye.
[MUSIC PLAYING]
MARTIN SPLITT: We've been having fun with these podcast episodes.
I hope you, the listener, have found
them both entertaining and insightful, too.
Feel free to drop us a note on LinkedIn
or chat with us at one of our next events we go to.
If you have any thoughts, let us know.
And of course, do not forget to like and subscribe.
Thank you so much for listening, and goodbye.
[MUSIC PLAYING]