Transcript Collector

Crawling smarter, not harder | Search Off the Record

2024-08-08 ยท en automatic

Open YouTube
[Music]
hello and welcome to another episode of
search off the record a podcast coming
to you from the Google search team my
name is John and today we have Lizzy and
Gary say
hi don't tell us what to
do yeah
hi thank you thank you so nice to have
you here last time we talked with Dave
smart and apparently we also talked
about
crawling but I was not
here for the listeners John is trying to
figure out Lizzy's notes because Lizzy
started reading this or wanted to read
this and then John was like no I do it
he would not let me do the intro so now
we are left with this intro which is
very confusing okay go forward Lizzy
okay so this is supposed to be a part
two for people who were not following
along I guess uh we had episode one with
Dave smart uh to talk about what is
crawling and we sort of did like a
background uh I don't know set the Stage
episode and since then Gary has posted
too many times about crawling on
LinkedIn so we thought maybe we could
talk about that what what do you mean a
why was I not told that da was part one
two what does it mean I'm posting two
much or too many things about crawling
what is too two two to T wo your English
construction is
weird I heard that you posted about
crawling but I actually didn't yes I
heard you told me that you posted about
crawling uh on LinkedIn and you got some
surprising responses from people
uh surprising in more senses than one
are you sure I'm pretty sure it was you
[Laughter]
oh I also
heard that this year you were going to
work on crawling oh is that was it is
that a true statement yeah at the
beginning of the year you thought maybe
you would do something with crawling
well yeah um and I mean we already done
some things I think but in general yes I
think should do more on crawling in the
sense that we should make it
more well we should craw somehow less
which would mean that we crawl more I
think you did post about that on
LinkedIn and then Barry post cross
posted that Google wants to crawl less
and then the internet broke because they
were like what Barry from this is like
Barry from search Eng table right yes
very
shorts oh cool I mean it's it's
something I I hear from a lot where they
think well Google usually crawls more
when he thinks my site is
good Google the googlebot they slash
them googlebot accepts all pronouns okay
then then that was fine I'm sorry are
you a spokesperson for Google
B yes okay so so people
thought that googlebot usually crawls
more when Google bot thinks that
something is good so the assumption is
that you can turn it around as well and
be like well I will push googlebot to
crawl more and then googlebot will think
my site is actually
good which no I mean is that like a
chicken and an egg thing though what
like does your site have to be good
first for Google to then crawl it more
or just Google crawling more then means
your site is good I don't know Gary what
what do you think why me if if I can
make googlebot crawl my site more
because of my fancy robots. Tex
file does that mean that my site will be
better in SE I mean why would it I mean
it sounds like people are using this as
like a proxy like if Google is
interested in my site more often and
that means that stuff is good but it
could also mean that there's an infinite
space on the side so it's like it like
it's it's not oh that's a cool hack I'll
put a calendar script on my side no sit
down please has this always been a thing
that people think that more crawling is
equals good I think so I mean in one of
the presentations that we uh keep doing
search Central live events that is
actually about myth busting and it has
at least one or two questions about
crawling and then it's like oh Google is
crawling my a lot so my site must be
very good and like n not really like it
can mean many things but generally if a
site is of or the content of a site is
of high quality and it's helpful and
people like it in general then Google
bot well Google tends to crawl more from
from that site but it can also mean that
I don't know the site was hacked and
then there's a bunch of new URLs that
Google bot gets excited about and then
it goes out and scrolling right like
crazy or we discover John's calendar
script and um then we try to craw every
single URL for every day until
20177 so it's it it can mean other
things as well than just quality but
then on the on the flip side if we are
not crawling much or we are gradually
slowing down with with crawling that
might be uh a sign
of uh low quality content or that
we
rethought the quality of the site
because it's but what if it's not
changing what if
it's what like the content so we go and
crawl it and they haven't made a change
why would we need to go crawl that often
again if they're not making a lot of
changes I mean we have to go back and
see if it if it changed right but if we
notice that it's not changing do we then
back but would that result in like
overtime less probably but I don't know
John has a s that he hasn't abdate
updated in
like
72
years um I'm looking at the logs here um
and um he could say it still gets
crawled yeah I think it's challenging
with with those kind of sites because
maybe it didn't get updated in the last
couple of months but maybe it gets
updated in five minutes okay so Google
still wants to check just in case that's
that's my understanding at least yeah I
I think with with regards to the amount
of crawling
and uh the external perception there's
also the aspect of like a lot of sites
have a lot of different pages and then
it's not so much that Google crawls one
page very often it's sometimes just like
well if you have all of these pages and
Google has never crawled them then
Google wouldn't be able to know what to
do with it so some of that perception of
like well if only Google could crawl
more then it would see that I actually
have some good content I I can kind of
understand that is it more about like
crawling more often like my my
assumption is that a lot of people just
look at the crawl stats report in search
console or server logs and just look at
the number of requests over time and
then you don't necessarily see it's like
oh it's looking at my homepage every day
but more like it's looking at 500 pages
every day but which ones are they hoping
to see like that just increasing over
time like what's the ideal state from
from from a site owner's perspective I
think so because that also seems like
maybe bad um you know that form that we
link to in on Onie on developers at
Google comes as search um where you can
report issues with the with Google bot y
um and those reports end up
uh in our inboxes and there we see
sometimes that people are like uh
increase our craw over
time um and it doesn't work like we are
not going to increase anyone's crawling
if they right in through that form like
if there's some crawling emergency then
we would decrease their um uh or the
crawl volume for that side but it's kind
of obvious that they want increased
crawling over time some some people
people want ah okay so you're saying
that like the form is there and you're
supposed to use it only to report like
too much uh like your servers are being
overloaded this people are filling it
out anyway and they're like give me more
yeah but it's a form like we we are
quite explicit about what you should use
that form for but then it's a form so
it's like people are going to people
anyway so um we get other requests as
well which we cannot satisfy but we
still get them how would that work or
have we ever considered a method like
that where people can't ask
automatically yeah we we had the setting
in search console but that was about
limiting right so reducing Li of crawl
but it's always about limiting like
because the the upper part that has to
be determined about what we what the
server tells us about how uh much it can
handle what if it says I can handle
everything
well it would not be able to like the we
would at one point we would crush the
server and we wouldn't be able to
connect to it so that would be a very
clear signal that we have to slow down
okay so is it more of a site
owners not uh understanding that Dynamic
when like what it means to request more
that that effect will then be that their
servers
crash I think the confusing part is that
there are two parts to this one is what
the server can handle and then there's
the quality aspect to it the content of
on the
site uh has to be uh of high quality and
useful for for users or helpful for
users um and then search would or the
search demand for crawling would
increase um and then we would crawl more
potentially um and then the technical
part comes into play like how much can
we actually crawl without harming the
server okay but it's not infinite like
there has to be a limit because the
server doesn't have infinite resources
right uh but this year you thought we
can optimize there that there's like
something that we can do I mean we were
thinking about this for a long time like
there was always coll optimizations
going around um and if you look at the
early posts on um blog posts on on
onesie on on the blog M um then even the
early days 2006 2007 they were
already um like Vanessa Fox former uh
product manager for the old Web Master
tools and the team were already thinking
about how to optimize crawling
more is it usually the same uh sort of
approach like we want to be more
efficient about what we're doing or is
it like a timing thing is there
something new that we could be doing
that we haven't thought of before it's a
combination I
guess like site Maps I don't know John
was involved with sit Maps early on um
but s Maps was one of those
optimizations um and on our side I don't
know like 304 and if modified since okay
um that that was something that had to
be implemented on our side support for
it I
mean
cool um and with if modified sense is
that something that you see people are
doing correctly or is is that something
others should be doing
differently wait if modified SC that's
a request header so it it's us doing it
correctly or well it it could be it
could be that the site says it's like oh
yes everything changed today oh I see
it's like we asked has it has it changed
since yesterday and decid yes yes it's
like you must take a look I see uh
because it could be something that's
automatically in place like yes I update
a link but then my CMS says okay today
is the new date that I published content
and so therefore it gets interpreted
that I made a change therefore come look
at it so I think so the response to an
if modified SS would be a 304 right I
think a 304 is not
modified I don't know off hand I would
have to ask my friend Gemini 304 not
modified HTTP server Response Code okay
so 304 would be it's like no Google it's
like nothing has changed here and a 200
I I think would be the response then if
it's like okay here is actually the new
version right um I I think there's also
like cing directives that you can
respond with um there is I I don't
remember the name of the Apache module
Apache server module but there are other
caching directives as well that you can
respond with I think on our side it's
implemented externally doesn't seem to
be used enough I think so basically
people are just responding with uh like
even if we send out the uh if modified
since uh request header uh servers are
responding with just 200 basically just
ignoring it and I don't think that's
necessarily a good thing but then at
least at Google there are a few products
that probably prefer that
MH probably I how so like for example
news I I would imagine that they don't
want especially for live news like live
blog stuff like really time sensitive
things that are happening like as
cricket matches happening or something
yeah we we don't want to cash
those I guess I I don't know but this is
exactly what I I I want to uh to analyze
that like how how much 304 is used by
external sites how many if modified s
headers are we sending out with with our
fetches um and then try to encourage
people to use it more because it can
save quite a bit of bandwidth and by
definition also resources for the
servers like on our side we don't
particularly care about the resources
for croing how does it save resources is
it because we can just do a little quick
check and then we don't have to fully
look at everything ex yeah exactly so uh
304 response that or I I if I remember
correctly the the RFC the standard the
standard says that you don't put don't
put the HTTP response body in it like
there should not be a response body it's
just a headers so basically you send
back what like a, bytes instead of like
a thousand 100,00 bites or whatever it
is it's a lot smaller back and therefore
not taking up as much space from our
side yeah and I guess the server doesn't
need to compile the full page yeah like
the server can just do the lookup in a
database and like oh nothing new like
move along without having to actually
compile the whole thing so it makes it
more efficient I I imagine for both
sides because like if like if you're are
thinking about our CMS that we are using
for onesie there are lots of moving Part
Parts on on onesie like for for example
if you go to the I don't know the the
blog homepage then you have the to on
the left or whatever we call it but the
book on the on the left you have the
title you have the metadata that we have
in the HTML uh we have the metadata from
def site the CMS that uh that we use and
then you have the content and then for
all of those you have to make these
weird calls to pull in and to compiled
the and then all those calls um they
cost resources uh but then if you can
just make the that one call that John
said that just check whether anything
changed just one call just one call and
it doesn't matter if it's uh like that's
part step number two uh to figure out
whether or not something actually did
change like we're just checking anyway
doesn't matter uh if the change is big
or not I assume like in the next step it
would be to see like okay what well what
changed well that's I I think on on the
server side the server basically just
says like something changed here's
everything it's not like here's a part
of the page that has changed is that
something like a theoretical uh space
that we could look at like if if we
could say like hey actually it was just
this one paragraph that's where I made
the change you don't need to look at
everything just this one thing was the
change would that be helpful if that
were able to be like compartmentalized
somehow I like from my point of view
probably but implementing it sounds like
a night
I don't know maybe Gary wants to do it
anyway what I mean is this something
that you would be thinking about or is
this like nope crazy no it it's not I
mean it's crazy but it's the the kind of
crazy that we actually like what good
okay um so it's a it's a challenging
task um that can save lots of resources
for the internet not on our side because
again like I wouldn't say that we have
infinite resources but especially with
crawling it's like it's a tiny tiny tiny
fraction of our resource uses you I ran
out of
air crawling is a tiny fraction of our
resource usage and but from like an
external perspective where they have to
render the pages yeah um and make all
those calls to make one
page just sending back the part that
actually
changed that like sounds like a cool
thing yeah and especially with
um uh even in older HTTP versions like
um I think starting from
one1 um there was a chunked um transfer
so basically you could just say that
from this uh segment to this segment
this is the part and then you could just
give that to the to the client from the
from the server but it was more
complicated and I think it was slightly
broken uh like every now and then the
chunks would get get messed up but then
um someone pointed out on LinkedIn that
the ITF is working uh or someone on the
uh on the ITF track internet engineering
task force um which is a standards body
where like the robots exclusion protocol
also lives someone submitted a proposal
for a new kind of Chunk um transfer MH
um and I'm watching that closely to see
where it's going how are they currently
thinking about it is it like a i
navigation up here and then the middle
of the page is here or is it something
more like this stuff changes really
that's why that's my naive thinking I I
think it's more complex than that and I
would need to check the the current
draft to to tell you like how how it
actually works um but uh my naive
thinking that was that like here's the
header here's the
sidebar I'm fairly certain it's not that
simple I I imagine that's tricky because
you almost have to render the page to
understand a Dom if you're saying like
oh the header changed yeah whereas from
from a technical point of view if you
can say oh bytes 500 to 700 are now this
thing then that's easier but it's but
people don't reliably put it in that
same
spot we it's free like it's more
interesting because
and more reliable most likely because
it's not up to the person it's down to
the server and of course you you can
hack around with a server and make it
like like both John and I did stupid
things with our servers to to to fool
people interesting apparently John
didn't okay never I take it back never
um like you can do you can make the
server do stupid things but you need
quite a bit of knowledge about like like
in my case I was on Apple G about
um server modules like EP modules and
especially C to be able to
modify uh modules enough to make them do
something stupid I I think it's also
challenging because it mixes the content
with the infrastructure yeah it's almost
like different levels of interaction but
I I think it would be cool if if people
could say it's like oh actually only
this news item changed yeah or like on a
product page like my pricing this little
area is like the thing that is changing
all the time but the description of this
pair of shoes is the same exactly yeah I
I don't know from personal point of view
I I think that would be cool you know
and the the chunked encoding or the
chunk transfer I I think is is pretty
common like it's also done for videos I
think for large files where you have to
for large for large files for sure yeah
also I I think posts like a post methods
yeah I don't know that that sounds
pretty
cool um what what other kinds of
optimizations do you do you see
happening with regards to crawling
maybe better URL parameter
handling what oh okay like hashtags oh
hashtags hashtags hashtags are
complicated and we have a very comp
complicated relationship with them I
think do you mean hashtags or like what
is it anchors like the the pound oh
sorry the pound symbol the hash symbol
yeah I just assumed that you meant that
sorry I I did mean that so the problem
with them is that they only live on the
client
side okay and why is it a problem Oh
this is because you hate JavaScript
right what I mean yeah but what they're
they're used for JavaScript so for the
the the whole client side server side
like why is it a problem that it's on
the client side it's harder for us to
get there uh pretty much okay it's
further away from us
well Tech technically Google bot cannot
get get
there without rendering without
rendering I see okay and the the URL
parameters that you mentioned that would
be something like the URL parameter
handling tool that we used to have more
in a protocol format where you say this
parameter is optional
or oh that's a good idea can you give me
like a real example of sure like what
what do we mean by
youl ham hams like the HL equals and
whatever parameters that we have on on
Zend on support.google.com okay but like
what would make it hard I guess the fact
that we're using those because
technically you can add the in well
almost infinite well de facto infinite
number of parameters to any URL and the
server will just ignore those that don't
alter the response basically it will
just discard them but that also means
that for every single URL that there's
on that's on the internet you have an
infinite number of
versions because all this stuff can
because you can just add your parameters
to it okay and the
is instructed to ignore them like it
would not alter the content that it
returns but it also means that when you
are crawling and crawling in the proper
sense in like following links and I'm
air quoting here then
everything um yep I'm why are you
laughing like we are not following links
properly it's just like we are
collecting links and then we are going
back well you imply that there's an
improper use of crawling or an improper
way to crawl well yeah it's my pet be
it's like on on Onie we keep saying
Google but is following link it's like
no it's not following link it's
collecting links and then it goes back
to those LS it's not like properly
following links like the the picture
that we are painting is that Google but
is like hopping from it's because it's
going into the anthropomorphic territory
where Google bot thinks Google bot sees
Google bot understands understands
follows walking around on all eight
legs wait six legs how many like okay
don't judge what do you
mean there's got to be a correct answer
for this uh for spiders no spiders they
have an even amount of legs uh URL
parameters why is this a problem in
terms of crawling efficiently so it
sounds like it's because we don't we're
maybe wasting time looking at parameter
versions of the links when it could be
the same thing but sometimes it is
different sometimes it is different and
that's the problem yeah we don't know
based off of the URL like we basically
have to crawl first to know that
something is different and we have to
have a large sample of URLs to make the
decision that oh this these parameters
are uh are useless okay and there's no
way for external like uh site owners to
tell us how they're grouped now do do do
you know how we like to remove features
from search console yes I remember that
we took it away because it was not used
I think I mean it it was not used yes
and now it seems like we there's a need
to to be able to control this but they
weren't using the tool so maybe there
needs to be some other kind of solution
that would be right but like if someone
is complaining that we are over crawling
them because they have one of these
weird URL spaces with yeah an infinite
number of uh Euro parameters then we
could just tell them that okay use this
method to
to
block that that URL
space what kind of method like even
robot cxd could be used like it doesn't
have to be that is after this symbol
like don't look at it or this
combination or something like that
interesting because with Dro cxt you can
it's surprisingly flexible like what
what you can do with it and that's
something that we could do now or would
it require we just have to figure out
what to say oh interesting and I don't
have brains to think about it okay oh so
the solution to crawling is more
documentation oh job
security
darn so wait wait
wait we haven't asked John enough
questions about what his ideas are yeah
John what what are your ideas you keep
asking Gary but have you had any
hairbrain ideas hairbrain
ideas it's top of mind for me top of
mind so
sorry what's top of mine for you
um I I think I think it's is challenging
because I like I like sit maps for
example and apparently people also like
sit maps and they submit them in lots of
really weird and broken ways so that
makes me a little bit jaded almost in
the sense that it's like we will come up
with a new method to make crawling more
optimal for you and then everyone's like
huh well I will just use it
incorrectly yeah so that's that's kind
of the challenge and on on the other
hand I
also would like to make it so that
Google or other search engines don't
have to guess like how to crawl
optimally and uh it should be more clear
and easy for other search engines to
follow like why do we need to go
reinvent the wheel
maybe maybe I don't know but I I think
also just the the awareness of
everything around crawling I think that
makes a big difference uh I noticed that
uh for example when when I launched my
my first crawler back in the year
1822 it ran on this obscure operating
system called
windows and uh when when I initially
launched that I noticed that it's like
almost every site that you put in there
to try to crawl it's like it it goes
crazy like finds all of this crazy stuff
and it essentially shows how how
complicated the web is like all of these
weird links and they go in all different
places and some of them are broken some
of them are infinitely long yeah and I I
think just generally the awareness of
how crawling Works has gotten a lot
better over that time uh people use
common content management system like
WordPress now which make crawling a lot
easier and maybe some of that awareness
just has to go a little bit further to
make it so that more people understand
um potential pitfalls and then think
about like oh this parameter that I want
to add for tracking maybe I shouldn't or
maybe I should do it in a different way
so that it doesn't affect crawling like
what could be the consequence of my
actions of implementing this thing could
cause domino effect somewhere else yeah
I I think for smaller sites like you can
do a lot of things wrong and oh you have
a thousand URLs instead of 10 it's like
that doesn't change anything uh but if
you're giant e-commerce site and
suddenly you have a 100 billion URLs
instead of 1 million then that's kind of
a big
difference uh so some some amount of
awareness from both sides I I think is
important also the thing about
okay but I have enough resources so just
go ahead and crawl them anyway yes CU I
feel but then but then it's like we
could spend that time on URLs that will
actually help your
site because sure I I I don't like when
people think about craw budget but we
are still spending time on
crawling and you could apply it in a
productive way like why yeah is it's not
just exponential we just everything fire
hes and you will catch also the garbage
stuff that doesn't matter it's not
helping anyone
yeah so if you had to say one thing that
you wish people wouldn't do or would
your your pet peeve what would it be
John you canet peeve my my pet peeve
is at the moment and I I guess like at
at the moment means I I recently
received some Mees from folks about this
is people who don't look at the the
server stats in search console server
stats at the crawl stats craws the crawl
stats in search console because there's
a lot of information in there if you
just look at it for example response
time is in their average response time
and like are they just coming to your
inbox and saying John what is my average
response time like hello you can just go
look it up or what kind of question
answer is like 792 millisecs no no well
the the problem is the problem for me is
when it's not milliseconds anymore like
oh why are you not crawling my side
enough and I look at the stats and it's
like oh it takes on average like three
seconds to get a page from your server
it's like that's actually a very long
time we we don't really tell people like
what they should be aiming for there see
it's either is it an on and off thing
like it's either working or it's not and
if it takes 2 seconds versus 10 seconds
that's still not necessar we're not
showing it as broken well I mean like
several seconds is actually fairly long
like if if you want us to crawl a
million URLs from your website and
instead of 100 milliseconds it takes
like 10 times as much or 20 times as
much that's that's a big difference and
that's something where if you looked at
those stats then you could go to whoever
whoever's running your server and be
like look at these numbers these numbers
are objectively bad yeah you can improve
them and then they have something that
they can work on which is very different
from a lot of other SEO things where
it's like oh my relevance is not great
and then someone else on the server side
is like well okay I can't change that
this is more like a clear like an it's a
black and white sort of yeah number that
you can take back and say like things
are bad please fix it exactly and you
can multiply number of pages on your
site by the response time you're like
it's like this is a lot of time that is
being wasted
MH okay so open the Coss stats so look
at search console
yeah and Gary what do you think Gary you
you you mentioned uh your pet peeve was
people
anthropomorphizing that's your pet peeve
that I do maybe yes uh but for the the
rest of the
people or in general like a a pet peeve
that you have about crawling that you
wish that people either knew or like a
misconception that you see like what the
heck if people would just do this or
stop doing this
hm I don't know if I have a pet peeve
really like there
are or a hill you will die
on
so I kind
of
want hosting companies
to
um help more their customers when things
go wrong because
I wouldn't say very often but every now
and then we see sites complaining to us
that Google but is not crawling them and
then we look at what's happening and
it's like uh uh their their DNS server
is blocking us or their server is
blocking us or their network is blocking
us and then we are like like we have no
idea where it's blocking but it's
blocking and it's on your side and they
are like no because the hosting company
was like it must be you like but it
cannot be you like we see that we cannot
connect to your
server like why would we not want to
connect to your server or your DNS um or
whatever and it's like no but the
hosting company was like it's on your
side and I understand that because of
how hosting companies are set up
nowadays that they are behind the CDN
that also eats up some of the uh trades
information um or they are on um uh
elastic clusters that grow and shrink
and um some of the again some of the
traces are lost but still
if we could just spend more time on like
telling
people we as like those who worked on
networking or whatever uh or server
management um how connections are made
and then help
people
understand and also debug their problems
that would be
fantastic um because like if you know
how a connection is made between two
between a client and a server then like
saying that it's on your side the
problem when a client cannot uh or it's
on the client side the problem when a
client cannot connect to a server that's
like a
stretch so so you're saying more search
console what's a search console more
more features in search console that I
was hearing like in like videos when
when you're doing something wrong or so
that tell the site over the Hoster we
should send more
messages but we should send all the
messages on a single day on a single day
yeah pile them up and then on I don't
know first off uh first day of the month
just just send out all the messages that
we I I have a better idea we post the
messages on social
media and then anyone can fix any s's
problem I know and then we tag
we tag people people yeah hey this is
your site this is your site and we tag
all the hosting companies oh to like
hello we can add them directly like the
companies no that's too much I mean
sometimes the crawling problem is also
on our side sure so like we we kind of
have to accept that they will do the
same thing maybe it's the last resort we
were not able to contact you via this
message so yes we are now broadcasting
we oh we did that before we've done that
before we've also sent faxes before
really faxes
yes is this like a setting this would be
great actually a great setting in in
search console sear console so instead
of like email notific like what method
would you like to be notified a fax
option a fax fact number yes it's
handwritten from John handwritten from
John wait we want people to be able to
read that you have bad handwriting I I
don't think I've ever seen your
handwriting I can't confir
actually I've never seen you write maybe
it's only speech to text all right I
think we are way over time potentially
my timekeeper didn't gesture anything so
I'm not sure we gestured a little bit a
little bit and I missed it because I
can't see that's fine okay it was fun it
was a good good discussion oh it was
yeah oh it well it was supposed to be
painful this was supposed to be well it
was painful good
to
me okay well that's it for this episode
next time on search off the Record we'll
be talking with Mii another product
expert uh about working with the search
console API thank you folks for
listening and goodbye goodbye
bye-bye we've been having fun with this
podcast and I hope you The Listener have
found it both entertaining and
insightful too feel free to drop us a
note on Twitter at Google search C or
chat with us at one of the next events
we go to if you have any thoughts and of
course don't forget to like And
subscribe thank you and goodbye
[Music]