Crawling Challenges: What the 2025 Year-End Report Tells Us.
2026-02-03 ยท en automatic
[music] Hello, bonjour. Welcome Giti to another episode of Search of the Record, our podcast. My name is Martin Split. I am from the Search Relations team and with me today is Gary Eish, also from the Search Relations team. Hi, Gary. Hello. Did I pronounce your last name right? >> No. A I tried so hard and I can still not do it properly. >> Yeah. Yeah. Yeah. It's Hungarian, so don't blame yourself. It's a horrible, horrible language. >> No, it's a wonderful language. Have you tried German? >> Oh, yeah. >> Yeah. [laughter] Yeah. I I have so much problem with the with high German and with the deas. Like that's my pet peeve. I think I have decent vocabulary, but with the dirty das and also if you have to like make it accusative and whatever, >> then it it becomes even more complicated. >> I just I just can't. And I'm so so grateful that the Swiss realized this, the Swiss Germans or the Germans speaking Swiss realized this and they just got rid of it. >> Yeah. Because in Swiss German you don't have dirt, you have d. >> Yeah. >> And I love it. But if I go to Berlin for example or Frankfurt and then I I don't know I have to say something and I say it in Swiss German then they are just blinking at me >> and then I would have to say whatever I said in high German and then they would correct me. >> The Swiss do use the articles but they use it differently. So um in high German it's dram and here it's dust and I think that's also confusing anyway it doesn't matter but in speaking when you are speaking you are not pronouncing it fully >> that's true in in Swiss German at least in you're not yeah that's true >> yeah it's just the or something like that >> you can also use s for everything >> okay fair enough Perfect. Perfect. >> Great, isn't it? >> Yeah. >> So, what do you want to talk about? >> I want to talk about the things that aren't perfect because I know that you have had a look at like crawling throughout the year and I'm I'm just curious like what are things that you found is are we doing well? Are we doing not so well? What what has gone give me like the the 2025 wrapped in crawling? >> Well, did you read my report that I sent to the team? >> I did cursory reading. Yes. Okay, that's what I was expecting from you. I can't expect anymore. >> But there was one category that that stuck out to me. >> So to give you the listener some background, our team handles that report a crawl issue form. >> And basically when that comes in or someone submits that form and the form is uh validated or the form input is validated, then it would end up in our inbox. And once it ends up in our inbox then depending what's in the form we would take some sort of action. But the first thing that we need to do is to validate whether there is an issue or not. And then when you are validating the issue then you can categorize the issue into several categories. One of the categories is that there is no issue but then there is one two three four five buckets where we can put the issue into and then the team who's um basically ensuring cruel quality they would do different things based on where we put the or how we categorize the issue. >> Mhm. So the buckets that we have or internally I named them I'm reading the report right now so you're getting the actual download. The first is faceted navigation. The second issue category is uh action parameters. The third one is irrelevant parameters. Then uh we have calendar parameters or otherwise event dates. And then finally, we have basically an other category where we would put stuff that doesn't fit anywhere else. And this is the smallest one because the vast majority of the reports can be categorized into these buckets. >> Mhm. >> Or in the previously mentioned buckets. >> So what did we find? I I really like the report. I just think there are things that we should probably make available to the larger audience. Like what? Not my coffee. >> Not your coffee. But like the things that we saw and that we found and I know that some of these buckets are substantially larger than other buckets. >> Yeah. >> And they are implementation dependent, right? >> Yes. >> So, >> so a large chunk of the issues that we looked at is related to faceted navigation. That's fascinating because I keep seeing this discussed on Reddit and on social media and at conferences and stuff and I don't think it got that much attention and seeing that this is such a large percentage of the things that we looked at. >> Yeah. >> Is interesting. >> It's close to 50% of the total reports that we got. >> Mhm. >> Which says a lot. I think >> should we explain what it is? >> You do it. So if you have a website that allows filtering and sorting through various dimensions or or options. So for instance you have an online shop and you allow me diggite is great. For instance I needed a multi-socket adapter where I can like plug in multiple things into one power socket and I wanted them to be individually switched so I can individually switch them on or off. So it had an option to filter for multisocket adapters with individual switches. These kind of things tend to end up giving you a large number of combinations if you have a bunch of them. So you can filter by price, by category, by manufacturer, by whatever kind of details the the product might have. And that creates a URL that shows products you have in store that fit this kind of combination. But because they are combinations, you can end up with lots and lots of URLs with different variations of the individual settings. Right? Is that roughly summing it up? >> Right. And for the listener, uh, Diggitech or Galaxus is the Swiss version of basically Amazon. >> True. Sorry for that. Yes, >> there is no Amazon in Switzerland, but yeah, that's a that's a good summary. And it can cause lots of problems like the kind that takes down your server kind of issue because if you think about it once a crawler discovers it and we are only looking at Google bot for obvious reasons because that's our main crawler for search. We don't have visibility in what um Binkbot does for example or other crawlers do. But even for Googlebot that has close to 30 years of experience crawling the web, once it discovers a set of URLs, it cannot make a decision about whether that URL space is good or not unless it crawled a large chunk of that URL space. And if you put up a bunch of new URLs, a bunch meaning millions of new URLs that fit into a bunch of different URL patterns, then Googlebot will want to crawl all those URLs to make a decision whether it should crawl or should not crawl those URLs. And in that time while it's crawling it has the potential of rendering the site basically useless for users because it couldn't yet estimate that the site is under heavy load. It's just crawling a lot of URLs and then of course once we see the signals that the site is suffering we would back off. But until that happens, we are just crawling like madman to be able to decide whether we should crawl something or continue crawling these URL patterns or not. Right. >> Okay. Yeah. So like for instance, how can you determine that you are affected besides your server going down from all the crawling? >> I think that's the most severe symptom that the server is going down. But for example on my sites I do live access log analysis and then I would get an alert when some crawler ended up in my honeypot and then I would try to like figure out whether I want to black hole them or do something with that kind of traffic and that is definitely something that people can do especially if you have a a website that has a hosting platform like I don't know cPanel for example and that's probably something that you haven't heard in a million years. Martin, [laughter] >> so C panel is a hosting management platform. It was extremely popular in the 2000s or first decade of the 2000s. I don't know how popular it is nowadays, but I'm still using it because it's uh giving me access to a bunch of different things that allows me to look over the server and uh access the server that is hosting my websites. And uh among other things, it allows me to look at my access logs and do different kinds of analysis on my access logs. And there you I would immediately see that there is this one particular crawler that's doing something weird on on the website. And then I would have to decide what to do with that, right? Because not all crawl is bad. >> I think we can all agree with that. And you need to make the decision about whether the crawl was good or bad. True, >> right? because you know your website best hopefully. >> Hope hopefully. >> Yeah. And once you made that decision, then you can decide what to do with that kind of traffic. Let's say that you see that Googlebot is accessing this uh faceted navigation thing on your website and it's doing it quite aggressively. Then you can decide well this is actually good because it allows the bot to discover new content. In most of the cases that will not be true. >> Yeah. In most of the cases, we would have other ways to decide something is or to discover something new. So you decide that the traffic is bad and then you look at who's who's the accessor and then it's Googlebot and then you know that Google bot is uh following robots txt and then you can decide that maybe I want to disallow these paths that Google but is crawling right now. And of course that is not an immediate thing because robots txt files are cached for up to 24 hours or 24 hoursish. But it's still I think the most reasonable way to to handle or crawling of these bet spaces. Basically, you come up with a rule that will disallow crawling of your faceted navigation. And then if you need inspiration for how to do that, the google.com/rootsdxt actually has examples for not faceted navigation, but search parameters. Um, basically what kind of combinations we want to allow crawling and what combinations we do not. And you can apply that same thing on your use case as well. >> Okay. And what other things did we find? Because that was roughly half of it. But we probably have other things that came to light, >> right? Like if you had to guess, don't look at the report. I know that you haven't looked. So don't don't look at the report. What would you guess the next thing is? >> Uh irrelevant parameters like UTM codes or something like that. >> Yeah, that's up there. Ha. Up there. But it's not the next biggest thing then. Uh, status codes. Some weird. No. Okay. >> Do Do you want me to save you? >> Yes. Please save me. >> I'm your I'm your only hope. >> Yes. G. General. Gary. You're [laughter] my only hope. >> Uh, it's uh action parameters. >> H. What are action par? What? >> It is something that we borrowed from security like web security a long long long time ago. >> We b what? So in get requests, yeah, HTTP get requests. >> You can design your website in a way that will make your life miserable. >> Oh, like action equals save or something like that. >> Sure. >> Oh god. >> But it doesn't it's not limited to action equals whatever. >> Okay. >> It can't be literally anything. >> Yeah. Yeah. >> Because you can name your parameters whatever. >> It it can be something like update profile equals true or stuff like that. >> Yeah. Exactly. And then if you think back to the early days of internet because we are both old enough for that >> there was uh sure anytime any day any hour um there was uh an infamous thing going on where you would try to do myill injections >> through the URL parameters because you realize that login equals username perhaps is not a good idea when you are directly connecting that parameter to your MySQL database. >> Yeah. >> Um >> or any database really. Yeah. >> Drop table. >> Little Bobby tables as XKCD calls it. [laughter] >> Oh yeah, we should link to the XKCD thing in the podcast description. But yeah, action parameters they are making up close to 25% of the of the reports. >> T 25 what? >> Yeah. >> I thought in times of like what was it called? restful APIs and hyper media as the blah blah of operation state and GraphQL and stuff, we wouldn't see these kind of things. What? >> Yeah, exactly. And that was my reaction as well. And then if you start digging into it like what are these action parameters, they are more benign than drop table. >> Mhm. >> It's not that bad. But the things that Googlebot tends not to do is to shop around on the internet. Mhm. >> It will not buy your weirdo hoodie from your website. It doesn't have money in the first place. And second, why would it? Like we we we don't just have like warehouses where we put stuff that Google bot might buy. The next big thing was the add to wish list. >> Mhm. Okay. >> So basically, you add these to links that Google bot can extract. So basically here's a product page and then there's a link to the same product page like a south link but it has like question mark add to cart equals true or something like that. >> Okay. >> Or add to wish list equals true. >> Wow. >> And then if you just add only one of these like add to cardart that immediately doubled your URL space. >> Yep. >> Same for add to wish list. >> Yep. Great. add one more like you could do like add to cart and percent add to wish list and you have triple. >> Oh no. >> So yeah, that's how it ended up being 25%. And then I mean we we try really quite hard not to push back on these reports because um those who are reporting these issues they are in distress already enough. Mhm. >> So we would try to dig into like where are these coming from and then sometimes you can identify that perhaps these action parameters are coming from uh WordPress plug-in because WordPress is quite a popular uh CMS content management system and then you would find that yes these plugins are the ones that add the add to cart and add to wish list and then what you would do if you were a Gary is to try to see if they are open source in the sense that they have a repository where you can report bugs and issues and in both of these cases the answer was yes. Um, so we would file issues against these uh plugins and then for example what I really really loved is that the good folks at Woolcommerce almost immediately picked up the issue and they solved it. And then the other one, I don't remember which one, the other issue that was coming from a different plug-in. Um, as far as I can tell, that issue is uh still sitting there unclaimed. But >> if we can fix it at scale, then instead of filing some internal bug to like try to figure out how to handle these add to car parameters better, we would go out on the internet and then try to file an an issue against whoever is injecting these into websites. >> Wow. Do you know how how these came to be? Is it like why did they choose this way? There there are other ways to do this. Okay. Sure. I mean it's in our not job job description but in our realm to like go there and argue with them that like this is not the best way to do it. So if you >> like if you if you wanted to then you could like you have the links in the report and you could go there and argue that hey how about we use put requests or something because it's really uncommon for Google bot to >> to do put requests. >> But yeah I don't know why they chose it chose these ways. um they did and that's what matters >> for those who are reporting these issues to us. >> What would you think the next one is? The next issue category. >> I'm I'm doubling down on I think irrelevant parameters like UTM parameters or stuff. Yeah. Okay. >> That's really quite common. It's like 10% of all the reports. We are really good at handling session IDs and J session ID and UTM medium and whatever. >> Mhm. Unless you do something weird on the site >> like what? um like instead of session ID you just use uh like a single s equals >> oh >> because at that point we we don't know if that's like >> true >> ser service equals whatever or >> search >> search equals something or sentiment equals something and the value of these parameters often vary quite a bit like it could be just some numeric well string but it can also be some hexodimal randomness, but we cannot make a decision based on that >> because it might be some weird encoding that the the site can actually use. So s equals 1 2 3 4 5 6 could just mean that the user is uh looking for the service whose ID is 1 2 3 4 5 6 >> or a specific I don't know spreadsheet or whatever like we don't know. Mhm. >> Yeah. >> Yeah. The point is that we don't know and then we start crawling like crazy to figure out is this changing anything but then we need quite a considerable data set to make that decision accurately. >> Besides renaming the parameter is there any way you can avoid that. >> I mean session IDs are very 2000 so you could also just get rid of session ids but I think robots dxdt would work here as well. Mhm. >> I think crawlers don't need to see these session ids because they don't persist across sessions. They don't have session persistence. So, yeah, just don't. >> Yeah, just don't. Okay. >> Okay. Next one. >> Oh, god. Uh, >> wait, you had a question. What was the question? Yeah, you can you can use robots txt, but do you think this is a documentation problem or is this something well >> that people just don't know about? >> I think not a documentation problem because we do have it in the documentation like we have that URLs that Google can handle or something. >> Okay. >> Documentation page and that as far as I remember explicitly calls our session ID. >> Okay. All right. >> Or at least used to. And then I said that ah we should remove it because session I are so 2000s but yeah it is still big. It is sitting on the third place. I hate it. >> Yes that's quite big. >> It is what it is. >> All right. So and we're talk when we're talking crawling problems we are usually talking about like the crawl space problems I guess right. Okay. H what else can blow up crawl space soft force? Nah. Ah, I mean, yes, but not it's not in the list. >> Okay. I only remember like these felt like they were one-off cases. I know that you had this one plugin that we were asking me about like if we can figure out how to reach out to them because they added some sort of event widget or something. >> Oh my god. Yes. >> That created like lots of URLs, but that feels like a kind of oneoff thing. >> Uh, it is not. It is 5% of all reports. Um, so basically if I don't know you have a calendar on your site >> and then you have a page for every single day and then you would actually inject something on the page so we cannot detect the soft 404 then we have no way to tell that something is an infinite space and then what you are mentioning that WordPress plug-in was uh still is injecting URLs that are completely bogus and basically generating calendar infinite spaces on every single path that they can. So basically >> uh example.com one would have an infinite space of these event or calendar date slash two would also have an infinite space and then slash one slash2 would also have an infinite space and basically literally every single one path that there is on the site would have its own infinite space. So it can be really bad and again like figuring out robots DXD disallow rule would be the most immediate and cleanest way to handle it unless you can hunt down the developer of the plug-in and convince them to change their ways which in this case we couldn't. >> Basically we tried to reach out a number of times and everything fell on deaf ears. >> Oh that's unfortunate. >> It is what it is. That's internet life. >> Is the plug-in open source? Can we like fix it on? >> No, it's a commercial thing. So, we can't even like open source because WordPress needs it to be open source, but otherwise it's a commercial thing. >> Okay. Okay. Okay. Okay. Dang it. >> Yeah. And then finally, we have just the the weird stuff of the internet sitting at like 2% I think or something like that. It's basically like I don't know like if you double person to encode a URL accidentally. >> Oh. Oh, but those are Oh, that but that's nasty. That that happens so quickly if you're not careful. >> Yeah. And it's basically you do your due diligence and then you person encode something on your website, but then some other plugin or whatever something that interacts with that link would re-encode it, the already encoded link or URL. And then you end up with something that we cannot handle because yes we percent decode the link that we extract the URL but then we are still left with a percent encoded >> URL because it was double encoded y >> and then we try to crawl those and then your website cannot handle them and then it will either throw weird errors that we will notice and we are going to be smart about it. But if it's just like showing us random content, then basically we are just going to be happy to crawl those bogus URLs. >> And this this problem is so easy to create because if you're not careful as a developer, you might be like, "Oh, uh I think we always encoded when we were like rendering the data, not when we put it in a database." And then someone else joins the team and they're like, "Oh, we are URL encoding right when we put it in the database." And then ah and then you end up with a mess because you fix the problem like two months in and then you have like a lot of content that is double encoded but a bunch of it is not and uh it's hard to catch and hard to fix. >> Yeah. >> Ah that's annoying. >> Anyway, that was it. That was the report. >> Wow. Okay. I'm I'm still mind blown with the faceted navigation being such a prominent uh >> I mean if you think about it makes sense I think. >> Yeah it does. But yeah, >> commerce is is quite big on the internet nowadays. So having that as the bulk of the reports to to me it makes sense. >> It is unfortunate that it is still a problem. I think we put up a blog post about it a couple years ago. Perhaps we can link to it in the uh description >> yes >> of the podcast episode. But yeah, it's still a problem. I think it's also a problem because some of these platforms don't offer people to fix these issues themselves. >> Yeah. Yeah. Especially if you don't have access to robots txt, that is tricky, I guess. Yeah. >> Yep. >> Unfortunate the the action parameters. First things first, I now have a name for these things. And the second thing that they are what were they like 20% 24% something like that. >> 24 25. Yeah. >> That's wild. That's a surprise. Interesting. I do hope that our listeners out there got something from this. I certainly did. That was wild. And uh thank you so much for taking the effort to dig through the bugs and uh having a look at this and compiling this report. That's really really cool. And thanks so much for taking the time to talk to me today. >> You mean the report that you haven't looked at? >> I have a lot of things to Yes. >> Thank you. >> Okay. Fine. Thank you. Thank you so much. And um to everyone listening out there, thanks a lot for joining us as well. And uh I hope you like this episode. Let us know in the comments below. And do subscribe and like and uh stay in touch with us, please. We're really looking forward to hearing from your thoughts on this kind of topic. >> Martin does. I don't >> I do. Yeah, I do care. Um >> it's okay that you don't. I I'm I'm taking that. >> Yeah, you said we. >> Okay, fine. I care. I'm sorry. Anyhow, I say thank you again and have a great time. Take care. and our VA in goodbye. I do. We've been having fun with these podcast episodes and we hope that you, the listener, have found them both entertaining and insightful, too. Feel free to drop us a note on LinkedIn or chat with us at one of the next events that we go to if you have any thoughts. And of course, don't forget to like and subscribe. [music] Thank you and goodbye.