Crawling smarter, not harder | Search Off the Record
2024-08-08 ยท en automatic
[Music] hello and welcome to another episode of search off the record a podcast coming to you from the Google search team my name is John and today we have Lizzy and Gary say hi don't tell us what to do yeah hi thank you thank you so nice to have you here last time we talked with Dave smart and apparently we also talked about crawling but I was not here for the listeners John is trying to figure out Lizzy's notes because Lizzy started reading this or wanted to read this and then John was like no I do it he would not let me do the intro so now we are left with this intro which is very confusing okay go forward Lizzy okay so this is supposed to be a part two for people who were not following along I guess uh we had episode one with Dave smart uh to talk about what is crawling and we sort of did like a background uh I don't know set the Stage episode and since then Gary has posted too many times about crawling on LinkedIn so we thought maybe we could talk about that what what do you mean a why was I not told that da was part one two what does it mean I'm posting two much or too many things about crawling what is too two two to T wo your English construction is weird I heard that you posted about crawling but I actually didn't yes I heard you told me that you posted about crawling uh on LinkedIn and you got some surprising responses from people uh surprising in more senses than one are you sure I'm pretty sure it was you [Laughter] oh I also heard that this year you were going to work on crawling oh is that was it is that a true statement yeah at the beginning of the year you thought maybe you would do something with crawling well yeah um and I mean we already done some things I think but in general yes I think should do more on crawling in the sense that we should make it more well we should craw somehow less which would mean that we crawl more I think you did post about that on LinkedIn and then Barry post cross posted that Google wants to crawl less and then the internet broke because they were like what Barry from this is like Barry from search Eng table right yes very shorts oh cool I mean it's it's something I I hear from a lot where they think well Google usually crawls more when he thinks my site is good Google the googlebot they slash them googlebot accepts all pronouns okay then then that was fine I'm sorry are you a spokesperson for Google B yes okay so so people thought that googlebot usually crawls more when Google bot thinks that something is good so the assumption is that you can turn it around as well and be like well I will push googlebot to crawl more and then googlebot will think my site is actually good which no I mean is that like a chicken and an egg thing though what like does your site have to be good first for Google to then crawl it more or just Google crawling more then means your site is good I don't know Gary what what do you think why me if if I can make googlebot crawl my site more because of my fancy robots. Tex file does that mean that my site will be better in SE I mean why would it I mean it sounds like people are using this as like a proxy like if Google is interested in my site more often and that means that stuff is good but it could also mean that there's an infinite space on the side so it's like it like it's it's not oh that's a cool hack I'll put a calendar script on my side no sit down please has this always been a thing that people think that more crawling is equals good I think so I mean in one of the presentations that we uh keep doing search Central live events that is actually about myth busting and it has at least one or two questions about crawling and then it's like oh Google is crawling my a lot so my site must be very good and like n not really like it can mean many things but generally if a site is of or the content of a site is of high quality and it's helpful and people like it in general then Google bot well Google tends to crawl more from from that site but it can also mean that I don't know the site was hacked and then there's a bunch of new URLs that Google bot gets excited about and then it goes out and scrolling right like crazy or we discover John's calendar script and um then we try to craw every single URL for every day until 20177 so it's it it can mean other things as well than just quality but then on the on the flip side if we are not crawling much or we are gradually slowing down with with crawling that might be uh a sign of uh low quality content or that we rethought the quality of the site because it's but what if it's not changing what if it's what like the content so we go and crawl it and they haven't made a change why would we need to go crawl that often again if they're not making a lot of changes I mean we have to go back and see if it if it changed right but if we notice that it's not changing do we then back but would that result in like overtime less probably but I don't know John has a s that he hasn't abdate updated in like 72 years um I'm looking at the logs here um and um he could say it still gets crawled yeah I think it's challenging with with those kind of sites because maybe it didn't get updated in the last couple of months but maybe it gets updated in five minutes okay so Google still wants to check just in case that's that's my understanding at least yeah I I think with with regards to the amount of crawling and uh the external perception there's also the aspect of like a lot of sites have a lot of different pages and then it's not so much that Google crawls one page very often it's sometimes just like well if you have all of these pages and Google has never crawled them then Google wouldn't be able to know what to do with it so some of that perception of like well if only Google could crawl more then it would see that I actually have some good content I I can kind of understand that is it more about like crawling more often like my my assumption is that a lot of people just look at the crawl stats report in search console or server logs and just look at the number of requests over time and then you don't necessarily see it's like oh it's looking at my homepage every day but more like it's looking at 500 pages every day but which ones are they hoping to see like that just increasing over time like what's the ideal state from from from a site owner's perspective I think so because that also seems like maybe bad um you know that form that we link to in on Onie on developers at Google comes as search um where you can report issues with the with Google bot y um and those reports end up uh in our inboxes and there we see sometimes that people are like uh increase our craw over time um and it doesn't work like we are not going to increase anyone's crawling if they right in through that form like if there's some crawling emergency then we would decrease their um uh or the crawl volume for that side but it's kind of obvious that they want increased crawling over time some some people people want ah okay so you're saying that like the form is there and you're supposed to use it only to report like too much uh like your servers are being overloaded this people are filling it out anyway and they're like give me more yeah but it's a form like we we are quite explicit about what you should use that form for but then it's a form so it's like people are going to people anyway so um we get other requests as well which we cannot satisfy but we still get them how would that work or have we ever considered a method like that where people can't ask automatically yeah we we had the setting in search console but that was about limiting right so reducing Li of crawl but it's always about limiting like because the the upper part that has to be determined about what we what the server tells us about how uh much it can handle what if it says I can handle everything well it would not be able to like the we would at one point we would crush the server and we wouldn't be able to connect to it so that would be a very clear signal that we have to slow down okay so is it more of a site owners not uh understanding that Dynamic when like what it means to request more that that effect will then be that their servers crash I think the confusing part is that there are two parts to this one is what the server can handle and then there's the quality aspect to it the content of on the site uh has to be uh of high quality and useful for for users or helpful for users um and then search would or the search demand for crawling would increase um and then we would crawl more potentially um and then the technical part comes into play like how much can we actually crawl without harming the server okay but it's not infinite like there has to be a limit because the server doesn't have infinite resources right uh but this year you thought we can optimize there that there's like something that we can do I mean we were thinking about this for a long time like there was always coll optimizations going around um and if you look at the early posts on um blog posts on on onesie on on the blog M um then even the early days 2006 2007 they were already um like Vanessa Fox former uh product manager for the old Web Master tools and the team were already thinking about how to optimize crawling more is it usually the same uh sort of approach like we want to be more efficient about what we're doing or is it like a timing thing is there something new that we could be doing that we haven't thought of before it's a combination I guess like site Maps I don't know John was involved with sit Maps early on um but s Maps was one of those optimizations um and on our side I don't know like 304 and if modified since okay um that that was something that had to be implemented on our side support for it I mean cool um and with if modified sense is that something that you see people are doing correctly or is is that something others should be doing differently wait if modified SC that's a request header so it it's us doing it correctly or well it it could be it could be that the site says it's like oh yes everything changed today oh I see it's like we asked has it has it changed since yesterday and decid yes yes it's like you must take a look I see uh because it could be something that's automatically in place like yes I update a link but then my CMS says okay today is the new date that I published content and so therefore it gets interpreted that I made a change therefore come look at it so I think so the response to an if modified SS would be a 304 right I think a 304 is not modified I don't know off hand I would have to ask my friend Gemini 304 not modified HTTP server Response Code okay so 304 would be it's like no Google it's like nothing has changed here and a 200 I I think would be the response then if it's like okay here is actually the new version right um I I think there's also like cing directives that you can respond with um there is I I don't remember the name of the Apache module Apache server module but there are other caching directives as well that you can respond with I think on our side it's implemented externally doesn't seem to be used enough I think so basically people are just responding with uh like even if we send out the uh if modified since uh request header uh servers are responding with just 200 basically just ignoring it and I don't think that's necessarily a good thing but then at least at Google there are a few products that probably prefer that MH probably I how so like for example news I I would imagine that they don't want especially for live news like live blog stuff like really time sensitive things that are happening like as cricket matches happening or something yeah we we don't want to cash those I guess I I don't know but this is exactly what I I I want to uh to analyze that like how how much 304 is used by external sites how many if modified s headers are we sending out with with our fetches um and then try to encourage people to use it more because it can save quite a bit of bandwidth and by definition also resources for the servers like on our side we don't particularly care about the resources for croing how does it save resources is it because we can just do a little quick check and then we don't have to fully look at everything ex yeah exactly so uh 304 response that or I I if I remember correctly the the RFC the standard the standard says that you don't put don't put the HTTP response body in it like there should not be a response body it's just a headers so basically you send back what like a, bytes instead of like a thousand 100,00 bites or whatever it is it's a lot smaller back and therefore not taking up as much space from our side yeah and I guess the server doesn't need to compile the full page yeah like the server can just do the lookup in a database and like oh nothing new like move along without having to actually compile the whole thing so it makes it more efficient I I imagine for both sides because like if like if you're are thinking about our CMS that we are using for onesie there are lots of moving Part Parts on on onesie like for for example if you go to the I don't know the the blog homepage then you have the to on the left or whatever we call it but the book on the on the left you have the title you have the metadata that we have in the HTML uh we have the metadata from def site the CMS that uh that we use and then you have the content and then for all of those you have to make these weird calls to pull in and to compiled the and then all those calls um they cost resources uh but then if you can just make the that one call that John said that just check whether anything changed just one call just one call and it doesn't matter if it's uh like that's part step number two uh to figure out whether or not something actually did change like we're just checking anyway doesn't matter uh if the change is big or not I assume like in the next step it would be to see like okay what well what changed well that's I I think on on the server side the server basically just says like something changed here's everything it's not like here's a part of the page that has changed is that something like a theoretical uh space that we could look at like if if we could say like hey actually it was just this one paragraph that's where I made the change you don't need to look at everything just this one thing was the change would that be helpful if that were able to be like compartmentalized somehow I like from my point of view probably but implementing it sounds like a night I don't know maybe Gary wants to do it anyway what I mean is this something that you would be thinking about or is this like nope crazy no it it's not I mean it's crazy but it's the the kind of crazy that we actually like what good okay um so it's a it's a challenging task um that can save lots of resources for the internet not on our side because again like I wouldn't say that we have infinite resources but especially with crawling it's like it's a tiny tiny tiny fraction of our resource uses you I ran out of air crawling is a tiny fraction of our resource usage and but from like an external perspective where they have to render the pages yeah um and make all those calls to make one page just sending back the part that actually changed that like sounds like a cool thing yeah and especially with um uh even in older HTTP versions like um I think starting from one1 um there was a chunked um transfer so basically you could just say that from this uh segment to this segment this is the part and then you could just give that to the to the client from the from the server but it was more complicated and I think it was slightly broken uh like every now and then the chunks would get get messed up but then um someone pointed out on LinkedIn that the ITF is working uh or someone on the uh on the ITF track internet engineering task force um which is a standards body where like the robots exclusion protocol also lives someone submitted a proposal for a new kind of Chunk um transfer MH um and I'm watching that closely to see where it's going how are they currently thinking about it is it like a i navigation up here and then the middle of the page is here or is it something more like this stuff changes really that's why that's my naive thinking I I think it's more complex than that and I would need to check the the current draft to to tell you like how how it actually works um but uh my naive thinking that was that like here's the header here's the sidebar I'm fairly certain it's not that simple I I imagine that's tricky because you almost have to render the page to understand a Dom if you're saying like oh the header changed yeah whereas from from a technical point of view if you can say oh bytes 500 to 700 are now this thing then that's easier but it's but people don't reliably put it in that same spot we it's free like it's more interesting because and more reliable most likely because it's not up to the person it's down to the server and of course you you can hack around with a server and make it like like both John and I did stupid things with our servers to to to fool people interesting apparently John didn't okay never I take it back never um like you can do you can make the server do stupid things but you need quite a bit of knowledge about like like in my case I was on Apple G about um server modules like EP modules and especially C to be able to modify uh modules enough to make them do something stupid I I think it's also challenging because it mixes the content with the infrastructure yeah it's almost like different levels of interaction but I I think it would be cool if if people could say it's like oh actually only this news item changed yeah or like on a product page like my pricing this little area is like the thing that is changing all the time but the description of this pair of shoes is the same exactly yeah I I don't know from personal point of view I I think that would be cool you know and the the chunked encoding or the chunk transfer I I think is is pretty common like it's also done for videos I think for large files where you have to for large for large files for sure yeah also I I think posts like a post methods yeah I don't know that that sounds pretty cool um what what other kinds of optimizations do you do you see happening with regards to crawling maybe better URL parameter handling what oh okay like hashtags oh hashtags hashtags hashtags are complicated and we have a very comp complicated relationship with them I think do you mean hashtags or like what is it anchors like the the pound oh sorry the pound symbol the hash symbol yeah I just assumed that you meant that sorry I I did mean that so the problem with them is that they only live on the client side okay and why is it a problem Oh this is because you hate JavaScript right what I mean yeah but what they're they're used for JavaScript so for the the the whole client side server side like why is it a problem that it's on the client side it's harder for us to get there uh pretty much okay it's further away from us well Tech technically Google bot cannot get get there without rendering without rendering I see okay and the the URL parameters that you mentioned that would be something like the URL parameter handling tool that we used to have more in a protocol format where you say this parameter is optional or oh that's a good idea can you give me like a real example of sure like what what do we mean by youl ham hams like the HL equals and whatever parameters that we have on on Zend on support.google.com okay but like what would make it hard I guess the fact that we're using those because technically you can add the in well almost infinite well de facto infinite number of parameters to any URL and the server will just ignore those that don't alter the response basically it will just discard them but that also means that for every single URL that there's on that's on the internet you have an infinite number of versions because all this stuff can because you can just add your parameters to it okay and the is instructed to ignore them like it would not alter the content that it returns but it also means that when you are crawling and crawling in the proper sense in like following links and I'm air quoting here then everything um yep I'm why are you laughing like we are not following links properly it's just like we are collecting links and then we are going back well you imply that there's an improper use of crawling or an improper way to crawl well yeah it's my pet be it's like on on Onie we keep saying Google but is following link it's like no it's not following link it's collecting links and then it goes back to those LS it's not like properly following links like the the picture that we are painting is that Google but is like hopping from it's because it's going into the anthropomorphic territory where Google bot thinks Google bot sees Google bot understands understands follows walking around on all eight legs wait six legs how many like okay don't judge what do you mean there's got to be a correct answer for this uh for spiders no spiders they have an even amount of legs uh URL parameters why is this a problem in terms of crawling efficiently so it sounds like it's because we don't we're maybe wasting time looking at parameter versions of the links when it could be the same thing but sometimes it is different sometimes it is different and that's the problem yeah we don't know based off of the URL like we basically have to crawl first to know that something is different and we have to have a large sample of URLs to make the decision that oh this these parameters are uh are useless okay and there's no way for external like uh site owners to tell us how they're grouped now do do do you know how we like to remove features from search console yes I remember that we took it away because it was not used I think I mean it it was not used yes and now it seems like we there's a need to to be able to control this but they weren't using the tool so maybe there needs to be some other kind of solution that would be right but like if someone is complaining that we are over crawling them because they have one of these weird URL spaces with yeah an infinite number of uh Euro parameters then we could just tell them that okay use this method to to block that that URL space what kind of method like even robot cxd could be used like it doesn't have to be that is after this symbol like don't look at it or this combination or something like that interesting because with Dro cxt you can it's surprisingly flexible like what what you can do with it and that's something that we could do now or would it require we just have to figure out what to say oh interesting and I don't have brains to think about it okay oh so the solution to crawling is more documentation oh job security darn so wait wait wait we haven't asked John enough questions about what his ideas are yeah John what what are your ideas you keep asking Gary but have you had any hairbrain ideas hairbrain ideas it's top of mind for me top of mind so sorry what's top of mine for you um I I think I think it's is challenging because I like I like sit maps for example and apparently people also like sit maps and they submit them in lots of really weird and broken ways so that makes me a little bit jaded almost in the sense that it's like we will come up with a new method to make crawling more optimal for you and then everyone's like huh well I will just use it incorrectly yeah so that's that's kind of the challenge and on on the other hand I also would like to make it so that Google or other search engines don't have to guess like how to crawl optimally and uh it should be more clear and easy for other search engines to follow like why do we need to go reinvent the wheel maybe maybe I don't know but I I think also just the the awareness of everything around crawling I think that makes a big difference uh I noticed that uh for example when when I launched my my first crawler back in the year 1822 it ran on this obscure operating system called windows and uh when when I initially launched that I noticed that it's like almost every site that you put in there to try to crawl it's like it it goes crazy like finds all of this crazy stuff and it essentially shows how how complicated the web is like all of these weird links and they go in all different places and some of them are broken some of them are infinitely long yeah and I I think just generally the awareness of how crawling Works has gotten a lot better over that time uh people use common content management system like WordPress now which make crawling a lot easier and maybe some of that awareness just has to go a little bit further to make it so that more people understand um potential pitfalls and then think about like oh this parameter that I want to add for tracking maybe I shouldn't or maybe I should do it in a different way so that it doesn't affect crawling like what could be the consequence of my actions of implementing this thing could cause domino effect somewhere else yeah I I think for smaller sites like you can do a lot of things wrong and oh you have a thousand URLs instead of 10 it's like that doesn't change anything uh but if you're giant e-commerce site and suddenly you have a 100 billion URLs instead of 1 million then that's kind of a big difference uh so some some amount of awareness from both sides I I think is important also the thing about okay but I have enough resources so just go ahead and crawl them anyway yes CU I feel but then but then it's like we could spend that time on URLs that will actually help your site because sure I I I don't like when people think about craw budget but we are still spending time on crawling and you could apply it in a productive way like why yeah is it's not just exponential we just everything fire hes and you will catch also the garbage stuff that doesn't matter it's not helping anyone yeah so if you had to say one thing that you wish people wouldn't do or would your your pet peeve what would it be John you canet peeve my my pet peeve is at the moment and I I guess like at at the moment means I I recently received some Mees from folks about this is people who don't look at the the server stats in search console server stats at the crawl stats craws the crawl stats in search console because there's a lot of information in there if you just look at it for example response time is in their average response time and like are they just coming to your inbox and saying John what is my average response time like hello you can just go look it up or what kind of question answer is like 792 millisecs no no well the the problem is the problem for me is when it's not milliseconds anymore like oh why are you not crawling my side enough and I look at the stats and it's like oh it takes on average like three seconds to get a page from your server it's like that's actually a very long time we we don't really tell people like what they should be aiming for there see it's either is it an on and off thing like it's either working or it's not and if it takes 2 seconds versus 10 seconds that's still not necessar we're not showing it as broken well I mean like several seconds is actually fairly long like if if you want us to crawl a million URLs from your website and instead of 100 milliseconds it takes like 10 times as much or 20 times as much that's that's a big difference and that's something where if you looked at those stats then you could go to whoever whoever's running your server and be like look at these numbers these numbers are objectively bad yeah you can improve them and then they have something that they can work on which is very different from a lot of other SEO things where it's like oh my relevance is not great and then someone else on the server side is like well okay I can't change that this is more like a clear like an it's a black and white sort of yeah number that you can take back and say like things are bad please fix it exactly and you can multiply number of pages on your site by the response time you're like it's like this is a lot of time that is being wasted MH okay so open the Coss stats so look at search console yeah and Gary what do you think Gary you you you mentioned uh your pet peeve was people anthropomorphizing that's your pet peeve that I do maybe yes uh but for the the rest of the people or in general like a a pet peeve that you have about crawling that you wish that people either knew or like a misconception that you see like what the heck if people would just do this or stop doing this hm I don't know if I have a pet peeve really like there are or a hill you will die on so I kind of want hosting companies to um help more their customers when things go wrong because I wouldn't say very often but every now and then we see sites complaining to us that Google but is not crawling them and then we look at what's happening and it's like uh uh their their DNS server is blocking us or their server is blocking us or their network is blocking us and then we are like like we have no idea where it's blocking but it's blocking and it's on your side and they are like no because the hosting company was like it must be you like but it cannot be you like we see that we cannot connect to your server like why would we not want to connect to your server or your DNS um or whatever and it's like no but the hosting company was like it's on your side and I understand that because of how hosting companies are set up nowadays that they are behind the CDN that also eats up some of the uh trades information um or they are on um uh elastic clusters that grow and shrink and um some of the again some of the traces are lost but still if we could just spend more time on like telling people we as like those who worked on networking or whatever uh or server management um how connections are made and then help people understand and also debug their problems that would be fantastic um because like if you know how a connection is made between two between a client and a server then like saying that it's on your side the problem when a client cannot uh or it's on the client side the problem when a client cannot connect to a server that's like a stretch so so you're saying more search console what's a search console more more features in search console that I was hearing like in like videos when when you're doing something wrong or so that tell the site over the Hoster we should send more messages but we should send all the messages on a single day on a single day yeah pile them up and then on I don't know first off uh first day of the month just just send out all the messages that we I I have a better idea we post the messages on social media and then anyone can fix any s's problem I know and then we tag we tag people people yeah hey this is your site this is your site and we tag all the hosting companies oh to like hello we can add them directly like the companies no that's too much I mean sometimes the crawling problem is also on our side sure so like we we kind of have to accept that they will do the same thing maybe it's the last resort we were not able to contact you via this message so yes we are now broadcasting we oh we did that before we've done that before we've also sent faxes before really faxes yes is this like a setting this would be great actually a great setting in in search console sear console so instead of like email notific like what method would you like to be notified a fax option a fax fact number yes it's handwritten from John handwritten from John wait we want people to be able to read that you have bad handwriting I I don't think I've ever seen your handwriting I can't confir actually I've never seen you write maybe it's only speech to text all right I think we are way over time potentially my timekeeper didn't gesture anything so I'm not sure we gestured a little bit a little bit and I missed it because I can't see that's fine okay it was fun it was a good good discussion oh it was yeah oh it well it was supposed to be painful this was supposed to be well it was painful good to me okay well that's it for this episode next time on search off the Record we'll be talking with Mii another product expert uh about working with the search console API thank you folks for listening and goodbye goodbye bye-bye we've been having fun with this podcast and I hope you The Listener have found it both entertaining and insightful too feel free to drop us a note on Twitter at Google search C or chat with us at one of the next events we go to if you have any thoughts and of course don't forget to like And subscribe thank you and goodbye [Music]