Transcript Collector

Crawling Challenges: What the 2025 Year-End Report Tells Us.

2026-02-03 ยท en automatic

Open YouTube
[music]
Hello, bonjour. Welcome Giti to another
episode of Search of the Record, our
podcast. My name is Martin Split. I am
from the Search Relations team and with
me today is Gary Eish, also from the
Search Relations team. Hi, Gary.
Hello. Did I pronounce your last name
right?
>> No. A I tried so hard and I can still
not do it properly.
>> Yeah. Yeah. Yeah. It's Hungarian, so
don't blame yourself. It's a horrible,
horrible language.
>> No, it's a wonderful language. Have you
tried German?
>> Oh, yeah.
>> Yeah. [laughter]
Yeah. I I have so much problem with the
with high German and with the deas. Like
that's my pet peeve. I think I have
decent vocabulary, but with the dirty
das and also if you have to like make it
accusative and whatever,
>> then it it becomes even more
complicated.
>> I just I just can't. And I'm so so
grateful that the Swiss realized this,
the Swiss Germans or the Germans
speaking Swiss realized this and they
just got rid of it.
>> Yeah. Because in Swiss German you don't
have dirt, you have d.
>> Yeah.
>> And I love it. But if I go to Berlin for
example or Frankfurt and then I I don't
know I have to say something and I say
it in Swiss German then they are just
blinking at me
>> and then I would have to say whatever I
said in high German and then they would
correct me.
>> The Swiss do use the articles but they
use it differently. So um in high German
it's dram and here it's dust and I think
that's also confusing anyway it doesn't
matter but in speaking when you are
speaking you are not pronouncing it
fully
>> that's true in in Swiss German at least
in you're not yeah that's true
>> yeah it's just the or something like
that
>> you can also use s for everything
>> okay fair enough Perfect. Perfect.
>> Great, isn't it?
>> Yeah.
>> So, what do you want to talk about?
>> I want to talk about the things that
aren't perfect because I know that you
have had a look at like crawling
throughout the year and I'm I'm just
curious like what are things that you
found is are we doing well? Are we doing
not so well? What what has gone give me
like the the 2025 wrapped in crawling?
>> Well, did you read my report that I sent
to the team?
>> I did cursory reading. Yes. Okay, that's
what I was expecting from you. I can't
expect anymore.
>> But there was one category that that
stuck out to me.
>> So to give you the listener some
background, our team handles that report
a crawl issue form.
>> And basically when that comes in or
someone submits that form and the form
is uh validated or the form input is
validated, then it would end up in our
inbox. And once it ends up in our inbox
then depending what's in the form we
would take some sort of action. But the
first thing that we need to do is to
validate whether there is an issue or
not. And then when you are validating
the issue then you can categorize the
issue into several categories. One of
the categories is that there is no issue
but then there is one two three four
five buckets where we can put the issue
into and then the team who's um
basically ensuring cruel quality they
would do different things based on where
we put the or how we categorize the
issue.
>> Mhm. So the buckets that we have or
internally I named them I'm reading the
report right now so you're getting the
actual download. The first is faceted
navigation. The second issue category is
uh action parameters. The third one is
irrelevant parameters. Then uh we have
calendar parameters or otherwise event
dates. And then finally, we have
basically an other category where we
would put stuff that doesn't fit
anywhere else. And this is the smallest
one because the vast majority of the
reports can be categorized into these
buckets.
>> Mhm.
>> Or in the previously mentioned buckets.
>> So what did we find? I I really like the
report. I just think there are things
that we should probably make available
to the larger audience. Like what? Not
my coffee.
>> Not your coffee. But like the things
that we saw and that we found and I know
that some of these buckets are
substantially larger than other buckets.
>> Yeah.
>> And they are implementation dependent,
right?
>> Yes.
>> So,
>> so a large chunk of the issues that we
looked at is related to faceted
navigation. That's fascinating because I
keep seeing this discussed on Reddit and
on social media and at conferences and
stuff and I don't think it got that much
attention and seeing that this is such a
large percentage of the things that we
looked at.
>> Yeah.
>> Is interesting.
>> It's close to 50% of the total reports
that we got.
>> Mhm.
>> Which says a lot. I think
>> should we explain what it is?
>> You do it. So if you have a website that
allows filtering and sorting through
various dimensions or or options. So for
instance you have an online shop and you
allow me diggite is great. For instance
I needed a multi-socket adapter where I
can like plug in multiple things into
one power socket and I wanted them to be
individually switched so I can
individually switch them on or off. So
it had an option to filter for
multisocket adapters with individual
switches. These kind of things tend to
end up giving you a large number of
combinations if you have a bunch of
them. So you can filter by price, by
category, by manufacturer, by whatever
kind of details the the product might
have. And that creates a URL that shows
products you have in store that fit this
kind of combination. But because they
are combinations,
you can end up with lots and lots of
URLs with different variations of the
individual settings. Right? Is that
roughly summing it up?
>> Right. And for the listener, uh,
Diggitech or Galaxus is the Swiss
version of basically Amazon.
>> True. Sorry for that. Yes,
>> there is no Amazon in Switzerland, but
yeah, that's a that's a good summary.
And it can cause lots of problems like
the kind that takes down your server
kind of issue because if you think about
it once a crawler discovers it and we
are only looking at Google bot for
obvious reasons because that's our main
crawler for search. We don't have
visibility in what um Binkbot does for
example or other crawlers do. But even
for Googlebot that has close to 30 years
of experience crawling the web, once it
discovers a set of URLs, it cannot make
a decision about whether that URL space
is good or not unless it crawled a large
chunk of that URL space. And if you put
up a bunch of new URLs, a bunch meaning
millions of new URLs that fit into a
bunch of different URL patterns, then
Googlebot will want to crawl all those
URLs to make a decision whether it
should crawl or should not crawl those
URLs. And in that time while it's
crawling it has the potential of
rendering the site basically useless for
users because it couldn't yet estimate
that the site is under heavy load. It's
just crawling a lot of URLs and then of
course once we see the signals that the
site is suffering we would back off. But
until that happens, we are just crawling
like madman to be able to decide whether
we should crawl something or continue
crawling these URL patterns or not.
Right.
>> Okay. Yeah. So like for instance, how
can you determine that you are affected
besides your server going down from all
the crawling?
>> I think that's the most severe symptom
that the server is going down. But for
example on my sites I do live access log
analysis and then I would get an alert
when some crawler ended up in my
honeypot and then I would try to like
figure out whether I want to black hole
them or do something with that kind of
traffic and that is definitely something
that people can do especially if you
have a a website that has a hosting
platform like I don't know cPanel for
example and that's probably something
that you haven't heard in a million
years. Martin, [laughter]
>> so C panel is a hosting management
platform. It was extremely popular in
the 2000s or first decade of the 2000s.
I don't know how popular it is nowadays,
but I'm still using it because it's uh
giving me access to a bunch of different
things that allows me to look over the
server and uh access the server that is
hosting my websites. And uh among other
things, it allows me to look at my
access logs and do different kinds of
analysis on my access logs. And there
you I would immediately see that there
is this one particular crawler that's
doing something weird on on the website.
And then I would have to decide what to
do with that, right? Because not all
crawl is bad.
>> I think we can all agree with that. And
you need to make the decision about
whether the crawl was good or bad. True,
>> right? because you know your website
best hopefully.
>> Hope hopefully.
>> Yeah. And once you made that decision,
then you can decide what to do with that
kind of traffic. Let's say that you see
that Googlebot is accessing this uh
faceted navigation thing on your website
and it's doing it quite aggressively.
Then you can decide well this is
actually good because it allows the bot
to discover new content. In most of the
cases that will not be true.
>> Yeah. In most of the cases, we would
have other ways to decide something is
or to discover something new. So you
decide that the traffic is bad and then
you look at who's who's the accessor and
then it's Googlebot and then you know
that Google bot is uh following robots
txt and then you can decide that maybe I
want to disallow these paths that Google
but is crawling right now. And of course
that is not an immediate thing because
robots txt files are cached for up to 24
hours or 24 hoursish.
But it's still I think the most
reasonable way to to handle or crawling
of these bet spaces. Basically, you come
up with a rule that will disallow
crawling of your faceted navigation. And
then if you need inspiration for how to
do that, the google.com/rootsdxt
actually has examples for not faceted
navigation, but search parameters. Um,
basically what kind of combinations we
want to allow crawling and what
combinations we do not. And you can
apply that same thing on your use case
as well.
>> Okay. And what other things did we find?
Because that was roughly half of it. But
we probably have other things that came
to light,
>> right? Like if you had to guess, don't
look at the report. I know that you
haven't looked. So don't don't look at
the report. What would you guess the
next thing is?
>> Uh
irrelevant parameters like UTM codes or
something like that.
>> Yeah, that's up there. Ha. Up there. But
it's not the next biggest thing then.
Uh,
status codes. Some weird. No. Okay.
>> Do Do you want me to save you?
>> Yes. Please save me.
>> I'm your I'm your only hope.
>> Yes. G. General. Gary. You're [laughter]
my only hope.
>> Uh, it's uh action parameters.
>> H. What are action par? What?
>> It is something that we borrowed from
security like web security a long long
long time ago.
>> We b what? So in get requests, yeah,
HTTP get requests.
>> You can design your website in a way
that will make your life miserable.
>> Oh, like action equals save or something
like that.
>> Sure.
>> Oh god.
>> But it doesn't it's not limited to
action equals whatever.
>> Okay.
>> It can't be literally anything.
>> Yeah. Yeah.
>> Because you can name your parameters
whatever.
>> It it can be something like update
profile equals true or stuff like that.
>> Yeah. Exactly. And then if you think
back to the early days of internet
because we are both old enough for that
>> there was uh sure anytime any day any
hour um there was uh an infamous thing
going on where you would try to do myill
injections
>> through the URL parameters because you
realize that login equals username
perhaps is not a good idea when you are
directly connecting that parameter to
your MySQL database.
>> Yeah.
>> Um
>> or any database really. Yeah.
>> Drop table.
>> Little Bobby tables as XKCD calls it.
[laughter]
>> Oh yeah, we should link to the XKCD
thing in the podcast description. But
yeah, action parameters they are making
up close to 25% of the of the reports.
>> T 25 what?
>> Yeah.
>> I thought in times of like what was it
called? restful APIs and hyper media as
the blah blah of operation state and
GraphQL and stuff, we wouldn't see these
kind of things. What?
>> Yeah, exactly. And that was my reaction
as well. And then if you start digging
into it like what are these action
parameters, they are more benign than
drop table.
>> Mhm.
>> It's not that bad. But the things that
Googlebot tends not to do is to shop
around on the internet. Mhm.
>> It will not buy your weirdo hoodie from
your website. It doesn't have money in
the first place. And second, why would
it? Like we we we don't just have like
warehouses where we put stuff that
Google bot might buy. The next big thing
was the add to wish list.
>> Mhm. Okay.
>> So basically, you add these to links
that Google bot can extract. So
basically here's a product page and then
there's a link to the same product page
like a south link but it has like
question mark add to cart equals true or
something like that.
>> Okay.
>> Or add to wish list equals true.
>> Wow.
>> And then if you just add only one of
these like add to cardart that
immediately doubled your URL space.
>> Yep.
>> Same for add to wish list.
>> Yep. Great. add one more like you could
do like add to cart and percent add to
wish list and you have triple.
>> Oh no.
>> So yeah, that's how it ended up being
25%. And then I mean we we try really
quite hard not to push back on these
reports because um those who are
reporting these issues they are in
distress already enough. Mhm.
>> So we would try to dig into like where
are these coming from and then sometimes
you can identify that perhaps these
action parameters are coming from uh
WordPress plug-in because WordPress is
quite a popular uh CMS content
management system and then you would
find that yes these plugins are the ones
that add the add to cart and add to wish
list and then what you would do if you
were a Gary is to try to see if they are
open source in the sense that they have
a repository where you can report bugs
and issues and in both of these cases
the answer was yes. Um, so we would file
issues against these uh plugins and then
for example what I really really loved
is that the good folks at Woolcommerce
almost immediately picked up the issue
and they solved it. And then the other
one, I don't remember which one, the
other issue that was coming from a
different plug-in. Um, as far as I can
tell, that issue is uh still sitting
there unclaimed.
But
>> if we can fix it at scale, then instead
of filing some internal bug to like try
to figure out how to handle these add to
car parameters better, we would go out
on the internet and then try to file an
an issue against whoever is injecting
these into websites.
>> Wow. Do you know how how these came to
be? Is it like why did they choose this
way? There there are other ways to do
this. Okay. Sure. I mean it's in our not
job job description but in our realm to
like go there and argue with them that
like this is not the best way to do it.
So if you
>> like if you if you wanted to then you
could like you have the links in the
report and you could go there and argue
that hey how about we use put requests
or something because it's really
uncommon for Google bot to
>> to do put requests.
>> But yeah I don't know why they chose it
chose these ways. um they did and that's
what matters
>> for those who are reporting these issues
to us.
>> What would you think the next one is?
The next issue category.
>> I'm I'm doubling down on I think
irrelevant parameters like UTM
parameters or stuff. Yeah. Okay.
>> That's really quite common. It's like
10% of all the reports. We are really
good at handling session IDs and J
session ID and UTM medium and whatever.
>> Mhm. Unless you do something weird on
the site
>> like what? um like instead of session ID
you just use uh like a single s equals
>> oh
>> because at that point we we don't know
if that's like
>> true
>> ser service equals whatever or
>> search
>> search equals something or sentiment
equals something and the value of these
parameters often vary quite a bit like
it could be just some numeric well
string but it can also be some hexodimal
randomness, but we cannot make a
decision based on that
>> because it might be some weird encoding
that the the site can actually use. So s
equals 1 2 3 4 5 6 could just mean that
the user is uh looking for the service
whose ID is 1 2 3 4 5 6
>> or a specific I don't know spreadsheet
or whatever like we don't know. Mhm.
>> Yeah.
>> Yeah. The point is that we don't know
and then we start crawling like crazy to
figure out is this changing anything but
then we need quite a considerable data
set to make that decision accurately.
>> Besides renaming the parameter is there
any way you can avoid that.
>> I mean session IDs are very 2000 so you
could also just get rid of session ids
but I think robots dxdt would work here
as well. Mhm.
>> I think crawlers don't need to see these
session ids because they don't persist
across sessions. They don't have session
persistence. So, yeah, just don't.
>> Yeah, just don't. Okay.
>> Okay. Next one.
>> Oh, god. Uh,
>> wait, you had a question. What was the
question? Yeah, you can you can use
robots txt, but do you think this is a
documentation problem or is this
something well
>> that people just don't know about?
>> I think not a documentation problem
because we do have it in the
documentation like we have that URLs
that Google can handle or something.
>> Okay.
>> Documentation page and that as far as I
remember explicitly calls our session
ID.
>> Okay. All right.
>> Or at least used to. And then I said
that ah we should remove it because
session I are so 2000s but yeah it is
still big. It is sitting on the third
place. I hate it.
>> Yes that's quite big.
>> It is what it is.
>> All right. So and we're talk when we're
talking crawling problems we are usually
talking about like the crawl space
problems I guess right. Okay. H what
else can blow up crawl space
soft force? Nah. Ah, I mean, yes, but
not it's not in the list.
>> Okay. I only remember like these felt
like they were one-off cases. I know
that you had this one plugin that we
were asking me about like if we can
figure out how to reach out to them
because they added some sort of event
widget or something.
>> Oh my god. Yes.
>> That created like lots of URLs, but that
feels like a kind of oneoff thing.
>> Uh, it is not. It is 5% of all reports.
Um, so basically if I don't know you
have a calendar on your site
>> and then you have a page for every
single day and then you would actually
inject something on the page so we
cannot detect the soft 404 then we have
no way to tell that something is an
infinite space and then what you are
mentioning that WordPress plug-in was uh
still is injecting URLs that are
completely bogus and basically
generating calendar infinite spaces on
every single path that they can. So
basically
>> uh example.com
one would have an infinite space of
these event or calendar date slash two
would also have an infinite space and
then slash one slash2 would also have an
infinite space and basically literally
every single one path that there is on
the site would have its own infinite
space. So it can be really bad and again
like figuring out robots DXD disallow
rule would be the most immediate and
cleanest way to handle it unless you can
hunt down the developer of the plug-in
and convince them to change their ways
which in this case we couldn't.
>> Basically we tried to reach out a number
of times and everything fell on deaf
ears.
>> Oh that's unfortunate.
>> It is what it is. That's internet life.
>> Is the plug-in open source? Can we like
fix it on?
>> No, it's a commercial thing. So, we
can't even like open source because
WordPress needs it to be open source,
but otherwise it's a commercial thing.
>> Okay. Okay. Okay. Okay. Dang it.
>> Yeah. And then finally, we have just the
the weird stuff of the internet sitting
at like 2% I think or something like
that. It's basically like I don't know
like if you double person to encode a
URL accidentally.
>> Oh. Oh, but those are Oh, that but
that's nasty. That that happens so
quickly if you're not careful.
>> Yeah. And it's basically you do your due
diligence and then you person encode
something on your website, but then some
other plugin or whatever something that
interacts with that link would re-encode
it, the already encoded link or URL. And
then you end up with something that we
cannot handle because yes we percent
decode the link that we extract the URL
but then we are still left with a
percent encoded
>> URL because it was double encoded y
>> and then we try to crawl those and then
your website cannot handle them and then
it will either throw weird errors that
we will notice and we are going to be
smart about it. But if it's just like
showing us random content, then
basically we are just going to be happy
to crawl those bogus URLs.
>> And this this problem is so easy to
create because if you're not careful as
a developer, you might be like, "Oh, uh
I think we always encoded when we were
like rendering the data, not when we put
it in a database." And then someone else
joins the team and they're like, "Oh, we
are URL encoding right when we put it in
the database." And then ah and then you
end up with a mess because you fix the
problem like two months in and then you
have like a lot of content that is
double encoded but a bunch of it is not
and uh it's hard to catch and hard to
fix.
>> Yeah.
>> Ah that's annoying.
>> Anyway, that was it. That was the
report.
>> Wow. Okay. I'm I'm still mind blown with
the faceted navigation being such a
prominent uh
>> I mean if you think about it makes sense
I think.
>> Yeah it does. But yeah,
>> commerce is is quite big on the internet
nowadays. So having that as the bulk of
the reports to to me it makes sense.
>> It is unfortunate that it is still a
problem. I think we put up a blog post
about it a couple years ago. Perhaps we
can link to it in the uh description
>> yes
>> of the podcast episode. But yeah, it's
still a problem. I think it's also a
problem because some of these platforms
don't offer people to fix these issues
themselves.
>> Yeah. Yeah. Especially if you don't have
access to robots txt, that is tricky, I
guess. Yeah.
>> Yep.
>> Unfortunate the the action parameters.
First things first, I now have a name
for these things. And the second thing
that they are what were they like 20%
24% something like that.
>> 24 25. Yeah.
>> That's wild. That's a surprise.
Interesting. I do hope that our
listeners out there got something from
this. I certainly did. That was wild.
And uh thank you so much for taking the
effort to dig through the bugs and uh
having a look at this and compiling this
report. That's really really cool. And
thanks so much for taking the time to
talk to me today.
>> You mean the report that you haven't
looked at?
>> I have a lot of things to Yes.
>> Thank you.
>> Okay. Fine. Thank you. Thank you so
much. And um to everyone listening out
there, thanks a lot for joining us as
well. And uh I hope you like this
episode. Let us know in the comments
below. And do subscribe and like and uh
stay in touch with us, please. We're
really looking forward to hearing from
your thoughts on this kind of topic.
>> Martin does. I don't
>> I do. Yeah, I do care. Um
>> it's okay that you don't. I I'm I'm
taking that.
>> Yeah, you said we.
>> Okay, fine. I care. I'm sorry. Anyhow, I
say thank you again and have a great
time. Take care. and our VA in goodbye.
I do.
We've been having fun with these podcast
episodes and we hope that you, the
listener, have found them both
entertaining and insightful, too. Feel
free to drop us a note on LinkedIn or
chat with us at one of the next events
that we go to if you have any thoughts.
And of course, don't forget to like and
subscribe. [music]
Thank you and goodbye.