Transcript Collector

How Googlebot Crawls the Web

2025-05-29 ยท en automatic

Open YouTube
[Music]
Hello and welcome to a new episode of
Search of the Record, the podcast coming
to you from the Google Search team where
we talk all about search and maybe have
some fun along the way. My name is
Martin and I'm a I'm a I'm a job title.
Oh boy, we should update these notes.
Uh, my name is Martin and I'm a search
relations engineer on the search
relations team at Google. But I'm not
alone today. With me is Gary Mo.
Gary. Gary.
Yes. Are you here?
I here. Are you there? You you you hear
me? Okay. Good. Good. Gary, I have a
question. Okay. Um, I noticed that we
recently updated the crawler list and
someone um reached out to me and said
like they were crawled by Google. I'm
like uh I don't think that's a thing we
like a user agent we use, but apparently
that was one we used and I'm not sure
how that happened, but I I think that
was a whoopsie. But um how maybe we
should actually explain how Google bot
works and what that part of the pipeline
is. Uh what do you think? Should we talk
about Google
bot? We should talk about Google bot. I
mean technically that was correct like
it they were crawled by Google.
Fair. Yeah. Coming from a Google IP
address. Yeah.
Like technically correct. And we all
know that the best kind of correct is
technically correct. It's technically
correct. That's true. But um let's talk
about it. Let's start with something
like very obscure like let's just call
it crawler first. And it's been around
for as long as Google itself. Well,
actually probably predates it
because for starting an a search engine,
you need a crawler, right? For a bunch
of things, you need a crawler. Yeah.
Like when you go out pubbing or
something. Ah.
Oh, okay. No. H no. If if you want to
use data that is on the network, you
need something that requests it, right?
And I think isn't that what a crawler
fundamentally is or part of what a
crawler is? Crawler probably does more
than that, right? I mean, yeah, but
technically crawlers are just HTTP
clients,
right? Like much like your browser,
which is also
more advanced I guess HTTP client
because it can do more things than just
fetching data from over the network. Um,
but technically crawlers are just really
dumb browsers
maybe. I mean there's this library and
also a command line utility called curl,
right? C URL that's a crawler kind of I
guess. I mean you can you you can use it
as a crawler
like I mean worst case scenario you
would write a shell script for example
to
loop through a set of URLs that you pass
on to the curl thingy and then it
fetches the stuff for you and then you
save it to disk and technically that's
what a minimal crawler is as well I
think in in I don't know if if it is in
curl But in wget you definitely have an
option to recursively crawl something or
fetch something. So basically it will
attempt to extract URLs from the blah
that you fetched from a particular URL
and then it will attempt to access and
download those URLs and then you can set
the depth limit like how deep do you
want to go from the initial URL
uh and I think also whether you want to
stay on domain or not or something like
that but technically that's that's
already a crawler Right? Because you go
out on the internet, you find
URLs starting from one URL. You find
URLs and then eventually you will re or
fetch those URLs as well that you just
found. I think there was something um
Sergey Sergey Brin our co-founder I
think well yeah I I know that he's a
co-founder so not think but anyway uh I
think he said that
um uh this was very very early on that
if you take uh very popular page again
this was like mid '90s or end of ' 90s
or something like that 1990s for the
very young
listeners and uh if you take a very
popular page like the homepage of CNN or
Wall Street Journal, Fox Fox News or
whatever and you
can just follow the links where follow
means that once you found the a URL on a
page, you will just fetch it again. you
will fetch that URL and then so on
basically recursively just fetch URLs
that you found on the internet and if
you start from a very popular page you
can actually crawl the whole
internet just from that one starting
point now obviously that that doesn't
hold true anymore but yeah it it was a
much simpler internet and it was much
easier to fetch back then
but I guess if I were to write a shell
script that loops over a list of URLs
and maybe even extracts URLs from these
URLs and then keeps going. There's
probably more to it than just that
because the internet has grown quite a
lot and I can imagine that that approach
won't work today. I mean it depends what
you want to do right because if you just
want to crawl set
of pages from a site for example like
you want to mirror your site
uh locally then technically you could do
that. I I think there are other problems
that need you need to take in
consideration especially nowadays like
there's so much automatic traffic on the
internet on sites
that like if if you want
to be on your best behavior then you
want to at least support robots txt
uh like the robots exclusion protocol
and have some system I guess I wanted to
say algorithm but I don't it's not like
a singular thing like have a system that
monitors the health of the host
and backs out like slows down if the
host is becoming unhealthy. Ah so it
adjusts crawl rate basically.
Yeah.
Mhm. because you don't necessarily want
to be
a an ass
and you want
to kind of behave, right? Mhm.
Okay. Otherwise, you are just like doing
the the server. Yeah. I guess your
neighborhood is not going to like that
if you bring down all the websites. H.
So when when Larry Larry started uh
doing his backup system that must have
eaten a lot of bandwidth.
I guess it's relative. Also, why would
anyone name anything backup? Like I
always had
a beef with that because it's a it's
such a weird name to to give to a even
even an academia search engine or
academic exploration or whatever. Like
calling it back rub is just it's creepy.
Why? It's a bit odd. Yeah. I mean, it's
a backlink and you get something for
backlinking and then it's like rubbing
someone else's back. But I I like you,
Gary, but please don't rub my back.
Thanks. You sure? Yeah, pretty sure. Oh,
yeah. Sorry. Okay. This is sad. No, but
I think I to to your point that that
must have consumed lots of bandwidth.
Like back in the
days, one thing was that pages were way
more lightweight. True. Like way way way
more lightweight. Uh like I remember
when
um one of the first sites, this was late
90s sites, a few pages. Well, it was a
site, I guess. And the the HTML that I
put together was like 7,000 bytes. Mhm.
Like 7. That's nothing today. That's
like an image basically. Actually,
images are probably larger these days.
Yeah. Much larger. like it's so tiny
that like even if you crawl like
hundreds of thousands of them, it's not
going to make a dent in the in your in
your budget. But on the flip side,
bandwidth was much more expensive than
nowadays. Uh so they must have had some
sort of system to monitor that they're
not exhausting the very expensive
bandwidth, their bandwidth. But then you
also have to take in consideration the
sites bandwidth. Oh god. Yes, true. They
pay for it as well.
So yeah, I I think it was much easier to
crawl the internet back in those days
when back was coming up online, but it
also was trickier for a different reason
and that was probably cost. Mhm.
But yeah, we we had backup. I don't know
how fetching was done for backup. I
would imagine that they just had some
shell script or something that just
fetched all those pages for them um to
create the initial index for backup.
Again, I'm just making this up. I I
actually have no idea,
but it was likely not that complicated
because the web was so so so so much
smaller.
But I mean, yeah, I I I was reading u
just before this recording the anatomy
of a large scale hypertextual web search
engine paper that Sergey and uh um Larry
published at Stanford
and they are talking
about
110,000 web pages and web accessible
documents for one of the early search
engines called Oh, that's cute. Yeah.
Worldwide web warm or
dubdubdub. This was
94. And then there was the other one
from the guy that came up with the idea
of robots dx, the robots exclusion
protocol. I wanted to say his name too
fast. I forgot. Um, and he also had a
search engine called
WebCroller and claimed to have indexed
about 2 million pages. Oh wow. And in
today's scale, 2 million is
still like that's cute,
right? I I think the boundary for
someone to worry about crawl budget is
what 10 million 1 million something like
that. I would like for a single site I
would say like 1 million is okay
probably
and that's pretty much like half the of
course it also
yeah but I mean you also have to like
when when we crawl about like crawl
budget or how much load we put on the
server um you also have to think about
how the site is constructed because if
you are making expensive operations to
construct the page then of course it's
going to put much harder load on sites
than a simple HTML site, right? True.
That's true. Like for like for example,
if you are making expensive database
calls, like that's going to cost the
server a lot. So yeah, but back then it
was much simpler anyway.
Wild and um okay, so bandwidth is a
thing. We've talked about that. Right
now Google does a lot of things that
probably need to ingest data from the
web or want to invest ingest data from
the web. Uh does that mean that we have
like lots of shell scripts
or how do we handle that these days?
Because I think bandwidth bandwidth
needs to be taken care of across
products. No.
Um,
yeah.
But I mean, back when we only had so we
had back rub or Larry and Sergey had
back rub, right? And then they launched
Google in 96.
Yeah. 96. Then they launch uh Google bot
basically the crawler that they were
using for the search engine. Um, I think
they might have named it Googlebot in
99. Like before that it was just like
nothing although I know that from the
very beginning of Google robots txt was
supported. Mhm. So like whatever they
were using it was already allowed site
owners to opt out from crawling. But
then we started having multiple or new
products, right? Um like we had
Adwords coming out in early 2000s. Um
and
then AdSense 2003 I
thinkish and then you also have some
kind of fetching in Gmail which is a
2005 2006 thing. So like for example
like fetching the images because you
don't want to allow the
browser to fetch the images remotely
because then you are giving away
um
users metadata to remote
sites. So you want to proxy somehow
those image fetches in in emails. Mhm.
Um anyway, so more and more products had
to do some
fetching and for a time I think
everything was done with the Google bot
which was just like this service that
you plugged a URL in and it fetched it
for you. And it was always just
Googlebot and you could give it a
million URLs or just five and it would
fetch it for you in the limits of
the of the host load um that sites
individual sites would have.
Not a very nice design
when you
are designing for multiple products
because then people can't really tell
apart what was the fetch for, right?
Okay. Yeah. Because it's just it just
looks like Google bot came and it
fetched something and you're like I'm
going to be on Google like early
2000s. Gary Gary very excited about
being on Google.
Um, but then it was actually a fetch uh
initiated by let's say Gmail or Adwords
or something. So I never end up in the
in the Google index and not because I
was a spammer. Definitely not because of
that.
Sure. Sure. Spammy guy.
So
we introduce new
crawlers. Um
but that would also mean
that with all the engineers, software
engineers that we have and computer
scientists, every now and then someone
would came up with a brilliant idea that
oh I will just write my own crawler
because I need my own user
agent which again is not great because
then like from maintenance perspective
it's an absolute nightmare
And then different crawlers that people
wrote might have different policies
about like robots cxdt and host load and
bandwidth usage and
whatnot. So eventually someone had to
come up with this idea that okay we will
just have this one unified system and
you can fetch with it from the internet
but you have to specify your own user
agent string when you are fetching. Mhm.
And then I think in 2006 2006ish Google
Adwords comes
out with Google AdSot. And then from
then on we started having more and more
and more um crawlers linked crawlers
that is not Googlebot.
And all of them behaved the same way.
I mean
yes. Yes. And that was the nice thing
about the shared infrastructure, right?
Because then you could have like a a
common way to behave on the internet for
every crawler that you send out.
All right. Okay. But sense that that
makes sense. That makes a lot of sense
because basically you now are bundling
all the so to speak traffic that goes
out to websites in terms of crowling
through a lens of one piece of code
which I think makes a lot of sense.
Yeah,
the one thing that I could see is
unfortunate is what if so I I see that
like all of them behave the same way
because they are all kind of robotic
agents that go out and do something for
an automated system. But what if I need
to write a piece of code that does more
or less the same but is like user
initiated? So if a user clicks on
something like I don't know I submit
something
for a review or for a specific product
where I specifically say like hey please
do this then I'm not sure if if
following robots makes sense for
instance. It might make sense in some
cases, but it might not because it's not
really a robot then if I ask it to do
something.
Um, I mean that's a very philosophical
question whether it's a robot or not.
And nowadays with all the AI agents and
whatnot, there's more and more
discussion about this. But yeah, you're
right. like when a user is sitting
behind the keyboard and wants to
complete a specific action. Let's say
that they want to load something in a
spreadsheet from a specific specific
URL. Then you are doing a fetch on
behalf of a user. So I think you're
right that ignoring robots DXT in those
cases is the right way right right thing
to do unless the team that is providing
that feature actually wants to follow
robust. Basically, you might want to
opt. The other thing
would be that the other thing would be
latency
because with with
crawlers
you have like a massive URL database
uh from where they take the URLs that
they need to fetch. Mhm. And then you
have to sort that somehow. And then
basically by the time and and then when
when a user would fetch then you add
that to that bucket to that database. It
ends up on the bottom of the list and
then you have to wait for the earlier
added URLs to be consumed until you
reach the URL that the user just added.
And that might sometimes take weeks as
well like like sometimes it's just like
you have no time to to to fetch fast
enough or you have other limitations.
And then with user agent fetchers, what
you can do is
that
I more or less
ignore the signals that the sides give.
Mhm.
And basically just try to make the fetch
immediately
and for example in in search console you
can see this when you do the the live
test site verification or the live test.
Well, actually not the live test site
verification. The live test is uh is
actually a crawler. Oh, yeah. Because it
needs to Yeah, it's a it's a high
priority, but it's still a crawler.
Yeah. Fair, fair, fair. But side
verification. Yeah, that makes sense.
That's a user trigger thing. And Yeah.
Mhm. Yeah. Um and you don't have to wait
for it for hours or weeks. It it happens
almost instantane instant. Instant I
will not say that word. Instantaneously.
Yeah. that.
Okay. But yeah, I I think you need both
of them
because like they are different use
cases. Mhm. Really? That makes that
makes sense. That makes sense. But I in
terms of different use cases, it doesn't
sound like this is a use case specific
to Google. So I guess other people have
crawlers as well then. Yeah.
And Okay. We were not the first ones to
do this, right? Yeah, exactly. Um like
uh the worldwide web
um operated their controllers before
Google was even conceptualized like even
before Larry had the idea that hey page
rank we could use this to do something.
Yeah. And since then we have other
search engines and uh I guess yeah a lot
of crawlers these days. Do you see like
a change in the way that crawlers work
or
behave over the years behave? Yes. How
they crawl? There's probably not that
much to to change. But
well, I
guess back in the days we had what? Uh
HTTP 1.1. Mhm. Or HT probably they were
not crawling on 0.9
because no headers and stuff like that's
Mhm. probably hard. But anyway, uh but
nowadays you have uh H2, H3. I mean, we
don't support H3 at the moment, but I
eventually why wouldn't we? And that
enables crawling much more efficiently.
Um, because you can stream stuff. Stream
meaning that you open one connection and
then you just do multiple things. Do
multiple things on that one connection
instead of um opening a bunch of
connections. So yeah, like the the way
the HTTP
clients work under the hood that changes
but technically crawling doesn't
actually change. Okay. Um and then how
different
companies polic
their uh or set policies for their
crawlers that of course differs greatly.
And if you are involved in in
discussions at the ITF, for example, the
intern engineering task force uh about
crawler behavior then you can see that
some publishers are complaining that
crawler X or crawler B or crawler Y was
doing something that they would have
considered not
nice. Yeah. So yeah, like the
policies might differ between crawler
operators, but in general the I think
the the well- behaved crawlers, they
they would all try to honor robots DXT
or robots exclusion protocol in general
and pay some attention to the signals
that sites give about their own load uh
or their servers load um and back out
when they can. And then you also have
the what are they called the adversarial
crawers
like Marwell scanners and privacy
scanners and whatnot. And then you would
probably need a different kind of policy
for them because they are doing
something
that they want to hide. Not for
malicious reason, but
because malware dist distributors
would probably try to hide their malware
if they knew that a malware scanner is
coming in. Let's say, okay, I was trying
to come up with another example, but I
can't. Anyway, yeah. What else do you
have? Okay. Well, um I think Oh, and
then and then you have the bad actors,
right? They are just like, I just want
to crawl half of the internet in 25
seconds. Yeah. They might overpower your
server and that is not a very nice thing
to happen. Huh.
Yeah.
Okay, so we have the need to ingest data
from the web and then you build
infrastructure to do that because it's
not a trivial thing and at Google we
have kind of like shared infrastructure
for that. That's that's pretty cool and
it we try to be a nice citizen of the
web. So hopefully other crawlers will
continue to do that rather than try to
ingest the whole internet in 25 seconds.
That's that sounds fun, but
uh I don't think that's feasible in the
long run. Also, for people operating
websites, you might just have like
random traffic spikes and these traffic
spikes might still cost you some money.
Yeah, I mean that that's one thing that
uh we've been doing last year, right?
Like we were trying to reduce our
footprint on the internet. M um and of
course it's not helping that then like
new products are launching or new uh
like AI products that do fetching for
various reasons and then basically you
saved seven bytes from each request that
you make and then this new product will
add back eight. But like you
you like like the internet can handle
the the the load from from crawlers like
I I firmly believe that the this will be
controversial and I will get yelled at
on the internet for this but it's not
crawling that is eating up the the
resources. it's indexing and potentially
serving or what you are doing with the
data when you are processing that data
that you fetch. That's um what's um
what's expensive and resource intensive.
So yeah, I will stop there. Okay. Before
I get in more trouble.
Okay, before I put you in more trouble,
thanks a lot Gary for explaining uh
crawlers to me. And um that's the past
and present for crawlers, but what's the
future going to look like? Are we
working on something or HTTP3 is
something that we will eventually get
around to I guess. But what else? Yeah,
I mean H3 is not going to solve the
bigger problems, I don't think. Um like
what like we just get the trailers, but
you get the trailers with H2 as well. So
it's like like it's not going to fix our
bigger problems. I So
well what do you think are the bigger
problems first before we talk about
solutions? The web is getting congested
and not and it's because like everyone
in my uh grandmother is launching a
crawler or fetchers or whatever we will
have more automatic traffic from from AI
agents for example um and other AI
shenanigans. So basically the web is
going to be more congested but it's not
something that the web cannot
handle like the the web is designed to
be able to to handle all that uh traffic
even if it's automatic and it's I I
would say that it's in good good hands.
If they see that there's some some
problem problems with load and whatever
then they will just come up with some
new technologies that
will fix that um
or reduce that that issue.
What what
else I I I really like what common crawl
is doing because they release data sets.
So basically they have their crawler and
then they crawl some parts of the
internet and then they release that as a
data set. So you don't have to crawl
yourself and I think that's very nice
because then you basically have the same
thing that we have internally basically
a single infrastructure doing the
fetching respecting robots txt and host
load and whatnot. Um, and then you can
just consume the data. Of course,
internally the you can't just consume
the data. That's different. Like you
still have to do fetches, but at least
the robots exclusion protocol policies
and the host load is enforced for for
the crawl job that you set up. Mhm. Um,
I don't know if we need more of these,
but yeah, I I thought it's a good idea
and it's a nice idea. Okay. All right.
Well, come and crawl then. Uh, that's
something that I don't think I looked
into. I should probably have a look at
that. Well, in that case, thanks a lot,
Gary, for giving me a journey through
the world of crawling. And um, I do hope
that you all out there enjoyed this
episode and had a good time. If so, let
us know in the comments. Like and
subscribe to hear more of our episodes.
And also tell us if you want to have an
a specific episode for a specific topic.
So, with that again, thanks Gary and um
enjoy your time listening to this out
there and uh bye-bye listeners. Goodbye.
Oh god, why? Bye-bye, Gary.
We've been having fun with these podcast
episodes, and we hope that you, the
listener, have found them both
entertaining and insightful, too. Feel
free to drop us a note on LinkedIn or
chat with us at one of the next events
that we go to if you have any thoughts.
And of course, don't forget to like and
subscribe. Thank you and goodbye.
[Music]