Handling Dupes - Same Same or Different? | Search Off the Record

2024-12-05 · en automatic
[Music]
hello and welcome to another episode of
search off the record a podcast coming
to you from the Google search team
discussing all things search and having
some fun along the way my name is Martin
and I'm joined today by John from the
search relations team of which I'm also
part of hi John hi Martin and we have a
special guest Alan Scott from the dubs
team hi Alan
dubs dubs dubs dubs dubs internally we
call it dupes but
okay oh I'm not a I'm not a native
English speaker for me it's dups okay so
you're like you're actually right we
spell it wrong we we put dups and so
everyone outside should think it's dups
but for some reason we always called it
dupes but but I think externally we call
it Canon a so which is even worse it's
it's yeah it's fantastic isn't it I've
been fighting that terminology for years
oh oh really let okay before we get into
that would you be so kind to introduce
yourself to our audience of course of
course so my name is Alan Scott I am an
I'm so software engineer at Google I've
been here over 12 years now I think and
uh I have spent almost all that time
working on the problem of duplicate
detection and elimination which uh wraps
into other friend problems s like signal
forwarding and these days even starts
pulling in other Wilder topics uh from
the fringes like error pages and
localization so uh yeah oh wow all right
so we've we've started this off with me
mispronouncing
dupes and you telling us that when
externally we talk about
canonicalization you really don't like
that why don't you like the term
canonicalization to be fair that's
something that I go up against usually
more internally cuz uh when people think
canonicalization they sort of Imagine
This one black box that does all the
magic things together and uh it's very
difficult to handle requests from people
that are like well why is
canonicalization wrong and and so um I I
tend to push people to think of it as
well canonicalization is one step it's I
have a bunch of URLs and I want to know
which of them is the canonical but there
are other steps that are as if not more
important here like the the first one
being clustering oh usually when people
come to us and complain about
canonicalization the immediate thing we
say is oh that's a clustering problem
because these two pages shouldn't be in
the same cluster let alone cases of
canonical selection like if you want to
bring a canonicalization problem to me
what that is is these two pages are in
the same cluster but they aren't
actually like we picked the wrong one
like the most dire case being a
hijacking uh we see those and we act
really fast cuz those are just disasters
so so clustering is basically taking the
pages that we think are the same and
then canonicalization is from those
pages which one is the best one is that
about right exactly yes okay yeah so for
example real canonical is a bit um bit
of a magic factor that crosses both
these lines real canonical will actually
it will first try to put two pages in
the same cluster it may or may not
succeed but if two pages are in the same
cluster and there is a real canonical
between them then it's also a canonical
selection signal oh so you say it's a
canonical selection signal does that
mean that there's other things that
could be a signal for canonicalization
uh I'm not sure what the exact number is
right now because it goes up and down
but I suspect it's somewhere in the
neighborhood of 40 whoa okay well now
our listeners will be making
spreadsheets with 40 signals
like like they used to do with those 200
ranking signals that we had but I I
think if I remember correctly hdp versus
https is one of them yes uh there's
actually multiple criteria that try to
deal with that dimension in specific cuz
we want to get that right but um it's
not as easy as it might seem the general
guiding principle we have is we want to
sort of what you see is what you get for
the end user where if we give them an
htps page page then it should actually
be secure whereas if we don't think it's
secure they should get an HTTP page um
that means that sometimes we follow the
Web Master signals and sometimes we
don't because web Masters might do
things like hey my htps page redirects
to my HTTP page and then to a different
https page that's not secure so that
will get you pushed to an HTTP canonical
if we can manage it interesting I I
guess the the issue of multi steps of
redirect that's that's challenging in
general
right yeah it's like finding which which
one is the right one to to show or which
one is maybe something tied to
personalization or the the location of
the user it's funny actually uh this all
kind of links together here uh because
we just came off HTP versus htps and now
we're talking redirects just recently uh
I I made an effort to sack one of
criteria and I'll give away the name it
was called redirect to shorter and it
had a really bad interaction with htb
htps because if you had conflicting
signals come from the Web Master this
one would push you to http oh so we
wanted to get rid of it for the longest
time oh that extra letter yeah literally
just that extra letter it's this is why
I like go ahead make your spreadsheets
some of these Criterion are not very
smart some of them are very tricky but
some of them are also very very basic
heris oh my okay wow but why do we even
need like 40 plus minus X signals I mean
website owners never make mistakes and
give you the correct canonical all the
time right so when it comes to trying to
figure out how to waight things one of
our biggest problems is we don't know
what to do when Web Master sends us
conflicting signals um the two most
common that come up there would be 301
versus real canonical um like those are
both very strong signals if your signals
conflict with each other what's going to
happen is the system will start falling
back on lesser signals so it'll start
listening to things like site Maps or
page rank or the now deceased redirected
to Shorter okay so if if you have
conflicting strong signals then
basically you're saying these don't
matter we just don't know how to train
the system in those cases because like
how does a human evaluate that at the
end of the day we can only train the
system as well as a human can evaluate
what the correct answer is we just don't
know once web Masters start giving us
confusing signals like that I I guess
you know you can't train a system to
just sit in a corner and cry because
that's what a human would do in that
case yeah we we train the system to be
ambivalent okay
that all right so we've heard about
redirects we've heard about clustering
and that actually the clustering bit
reminds me of something that keeps
coming up and I think this got a little
worse since Google search console
started primarily reporting only on
canonical URLs and that is when you have
a um a website that is in three regions
that have near duplicates so let's say
Germany and Switzerland both use German
MH and high German at that in in text uh
in written content and then you have
like a product page and it's pretty much
the exact same information except the
price and the currency add
and website owners make a huge effort of
like making that they tell us so this is
the version for the Swiss market this is
the version for the German market so
like they use atang and all these lovely
things that we have and yet one of these
gets chosen as the canonical and shown
in reports and then also this canonical
sometimes changes you know it makes
things uh interesting let's put it that
way how does that work but I I think it
also plays into the clustering bit right
if you tell us that it's kind of the
same but different language versions is
that is that part of a cluster
then this would be the localization
iceberg that we're now encountering you
you you can see the tiny sliver above
the waterline and then there's this
giant mass underneath if this topic
seems confusing externally it's also
confusing internally we we have been
trying to make localization work in a
reasonable way for a very long time um
because it's a very challenging subject
so you're asking about how clustering
works with localization well the answer
is it depends oh people love that
externally people love when you say it
depends yeah so so internally there's
essentially two categories of
localization types there are the the
localization types where it's just a
boilerplate translation which is
something you see very common especially
with big social media sites they they
don't translate the the content whereas
there are also translations that are
full translations where you will see the
actual content of the page fully change
yeah and I mean the boilerplate bit is
pretty pointless
right I mean yes largely speaking it
does not help a lot for people to see
hey this is the you know the Swedish
version of your favorite celebrities
social media feed that that is not
something that we're really concerned
with h doing out for people but uh the
the full translation pages should not
cluster because they have different
tokens they're going to retrieve for
different queries so we don't want them
in the same cluster we want to have all
those pages available for retrieval the
boilerplate translations we want to put
into the same cluster and uh and that
means that they'll consolidate signals
but it also means that we don't have to
crawl every single localization variant
because to be honest you know we're
wasting your bandwidth and and we're
wasting our space By by doing that so
that's why it depends uh there's there's
two different ways we want to handle
these things and you know what what
which one you're doing matters and and
then you get the really complicated ones
like what you said where they just
change the price and those ones become
more complicated because it's it's
basically the same content but for one
token but that one token really matters
um and in that one token case we still
want to have them in different clusters
that's a more challenging problem in
theory than you know not putting two
language variants in the same cluster
but uh you know
that's why localization is a hard space
in in the case of boilerplate
translations would we still try to swap
out the URLs when we show them in search
or oh absolutely uh so sitting on top of
all of this talk about clustering which
is the dupes system on its own there's
hre Lang which is a basically a separate
system where if you put in the
annotations we will try to substitute
them um John knows that uh there is a
project right now which may or may not
be live by the end of the year um that
is attempting to increase the reach of
that
specifically so we want to serve more
hling variants we want to utilize that
more but we need to put in place
mechanisms that will determine basically
how much we can trust it on a given site
so we're doing some crawl and
verification basically to determine
you know is this site serving its map
correctly uh and if so then we're going
to try to serve that more often without
necessarily having to verify it as much
as we currently do okay I I guess that
would also work for the Swiss and the
German versions hopefully yes I'm not
super familiar on the specifics between
you know German German and swiss German
but if there are minor differences then
I would expect this to be able to say oh
you're from Switzerland and there's an H
ref L entry for Swiss German so here you
go this is the right page for you cool
that's pretty nice yeah that that sounds
interesting and uh with X default do you
find that's something that sites
generally use correctly or it it always
feels tricky to explain that because by
the time you get to that it's like their
head already blows up from all of the at
laying so um Martin was asking me about
canonicalization signals earlier X
default is actually a signal and uh not
inconsequential one I don't know that it
is used very commonly it does seem to be
used reasonably well when it is used uh
I kind of wish people would use it a bit
more to to put this in perspective
you've kind of got two tools here one of
them is Rel canonical which says hey I'm
supposed to be clustered with this other
page and that other one is supposed to
be canonical X default is more of a hey
if you don't know what a local what
local to do or or I wind up in the same
cluster as this other page that's the
one you want for retrieval and that sort
of thing it is a sort of real canonical
in a way but not for clustering just for
canonical selection as long as the
signals align I guess if if you then use
other things like X default to one thing
but then real canonical to another thing
is probably confusing signals again no
yes but that's sort of expected in a way
right uh like we have to make
accommodations for that in this specific
case because you could imagine say I
have multiple different versions of this
Swiss page and I also have multiple
versions of this German page and I want
to real canonical those guys into their
own independent clusters but then I also
want them to be a member of this hling
map oh okay oh my God yeah no I this is
a complicated subject which is you know
why you know when I started it's like it
depends This is complicated there's an
ice B we're we're now starting to
descend now now you can start to feel
the the joy of of dealing with
localization mechanics oh boy do do you
think there will be a simpler variation
to do localization at some point I I
remember like it's at some point Gary
and I sat down we discussed options and
then my simpler solution was to use a
set of regular expressions and then I
realized this is not the wrong
direction a a set of regular expressions
and you call that a simpler mechanism
exactly uh yeah this topic has been one
that I've been hearing about since I
joined the company um and and good ideas
in the space have been hard to come by
which is why we're kind of running with
the best we've got right now so you know
you're you're you're rolling your eyes
and you're you're you're you're nodding
your heads and saying oh God this is a
mess and yes it kind of is but we don't
have better Solutions and in the
meantime things have just been a mess
anyway so so why not just run with
something that is at least slightly
better than the status quo is kind of
where I'm hoping to go I mean some of
the more advanced folks that are working
on these kind of international sites
they kind of understand what to watch
out for what to do and for those of you
out there who are wondering what are we
talking about and what is this
internationalization we did discuss this
Gary and I in episode 78 of this podcast
we're going to link that in the
description below as well so that you
can listen up on internationalization
and joy and fun but uh oh boy uh it's an
iceberg I I see that yeah I can see that
but that's not the only thing that you
do in clustering dealing with
localization I guess you have other
fantastic icebergs such as uh Arrow
Pages you
mentioned ah okay so so this is can can
I start by by by threatening people with
marauding black holes
[Laughter]
what error pages and clustering have an
unfortunate relationship where
undetected error Pages just get a check
sum like any other page would and then
cluster by check sum and so error Pages
tend to Cluster with each other that
makes sense at this point right
oh oh is that these cases where you have
like a website that has I don't know
like 20 products that are no longer
available and they have like repl it
with this item is no longer available
and it's kind of an arrow page but it
doesn't serve as an arrow page because
it serves as a HTP 200 but then the
content is all the same so the check
sums will be all the same and then weird
things happen right so that's a good
example yes that that is exactly what
I'm talking about now in that case the
Web Master might not be too concerned
because these products if they're if
they're permanently gone then they want
them gone so it's not a big deal now if
they're temporarily gone though this is
a problem because now they've all been
sucked into this cluster they're
probably not coming back out cuz crawl
really doesn't like dupes they're like
oh that page is a dupe forget it I never
need to crawl it again um so that's why
it's a black hole only the things that
are very towards the top of the cluster
are likely to get back out um and this
is where this really worries me is uh
sites with transient errors like what
you're describing there is sort of a
like an intentional transient error but
you know let's say that you've got 39's
reliability oh no well one out of every
thousand times you're going to service
your error and now you got a marauding
black hole of dead pages and it gets
worse because you're also serving a
bunch of JavaScript dependencies
JavaScript and if those fail to fetch
they might break your render in which
case we'll look at your page and we'll
think it's broken so the actual
reliability of your page after it's gone
through those steps is not necessarily
very high yeah um so we have to worry a
lot about getting these kinds of
marauding black hole clusters from uh
taking over a site because stuff just
gets dumped in them like there were
social media sites where I would look at
the you know the most prominent profiles
and they would just have reams of pages
underneath them some of them fairly
high-profile themselves that just did
not belong in that cluster oh boy okay
yeah I've I've seen something like that
when someone was AB testing a new
version of their website and then
certainly would break with error
messages because the API had changed and
like the the calls no longer worked or
something like that and then in like 10%
of the cases you would get like an error
message for pretty much all of their
content and uh yeah getting back out of
that was tricky I guess yeah I've I've
also seen something that I assume is
similar to this where uh if if a site
has some kind of a CDN in front of it
where the CDN does some kind of bot
detection or dos detection and then oh
yeah Ser something like oh it's like it
looks like you're a bot and Google bot
is yes I'm a bot but then all of those
pages I guess end up being clustered
together and probably across multiple
sites right yes basically Gary uh has
actually been doing some Outreach for us
on this subject you know we we we come
across instances like this and we do try
to get uh providers of these of services
to work with us well least work with
Gary I I don't know what he's what he
does with them he he's in charge of that
but uh not all of them are are as as
Cooperative so uh that's something to be
aware of and and I guess sites would
notice this in search console when when
it says like Google picked a different
canonical and then they look at it and
it's like this is a totally unrelated
page how does Google come up with this
idea yeah that's
this is the kind of thing that's leading
to that yes but what do I do so this
black hole sounds really scary
especially if you say like oh it's
really hard to get out of it again if it
happens for whatever reason or if I'm
launching a new website or a new revamp
of a website or new version of a website
how can I as the SEO on that website
make sure or what what do I need to look
out for to avoid this black hole uh the
easiest way is to serve correct HTTP
codes so you know send us a 404 or a 403
or a 503 and and if you do that you're
not going to Cluster we can only cluster
pages that serve a 200 oh only 200s go
into black
holes okay that's a good statement I I
like that that's a that's a pretty good
one only 200 ghost into the black
hole the the other option here is um if
you are doing JavaScript Foo in which
case you might not be able to send us an
HTP code might be a little too late for
that uh what you can do there is you can
attempt to serveice an actual error
message something that is very
discernably an error like you know you
could literally just say you know 503
this we encountered a server error or
403 you were not authorized to view this
or 404 we could not find the correct
file any of those things would work um y
you you know we even need to use HTTP
code obviously you could just say
something we do have well we have a
system that's supposed to detect error
pages and we we want to improve its
recall Beyond it currently does to try
to tackle some of these bad renders and
these uh you know bot serve Pages type
things but um in the meantime it's it's
generally safest to take things into
your own hands and try to make sure that
Google understands your intent as well
as possible and I I think externally we
call these soft 404 Pages yep okay and
internally we we sometimes call them
crypto 44 yeah that's that's the term
I'm more used to yes okay
uh quick question I usually recommend in
this case so we do have like client side
rener or single page applications uh
where we have this problem that you
can't change the HTP status code but you
could use JavaScript to redirect to a
page that is statically set to return a
404 or 500 or whatever it is would that
also avoid this clustering
issue uh I think so yes uh tler usually
straps those redirects together for us
at indexing time so we would effectively
see your page as the HTTP result at the
end of the chain mhm okay and the other
option we we sometimes tell people is to
use a no index on an page that basically
says
404 does that make sense I guess if it's
a page that is supposed to be
permanently gone then it would be
clustered with others
so yeah so from my perspective if you
serve us a no index that's very
different from serving in uh an HTTP
error code if you service an HTP error
code what actually happens is we'll say
oh this page suddenly went error but
maybe it isn't supposed to be so we give
you a bit of a grace period before we
remove you from the index if you serve
us a no index we're like oh they went no
index get this out get remove this they
can't we can't serve it so you're gone
okay okay there's a different urgency to
these two things that's interesting yeah
so I I would suggest not necessarily
serving no indexes on error Pages uh
unless you really want us to remove that
page if it's permanently an error then
go ahead no index at all you like um but
if it's temporarily an error no no no no
interesting okay so those are things
where where the content is clearly like
an error has
has malfunctioned in some way and then
we get an error but what about things
where we just make mistakes like what
happens if I accidentally cluster a
bunch of near duplicates into a
canonical situation and then realize oh
no I didn't want that can I undo like if
I fix my real canonical after things
have been clustered is that another
black hole kind of situation or are you
like oh okay yeah that one signal has
been fixed I kind of want to to punt you
over to the crawl team on this one okay
the the the problem with this is that
it's very much on crawl to decide when
to crawl things and I believe that web
Masters do have some recourse here they
they can request crawl to some extent
and don't know how effective that would
be in these cases because I'm not part
of the team that schedules crawl so I I
can't tell you how much they actually
listen to that feed I think they do
somewhat but it's not a dupes problem
well I mean all of these problems are
related like we actually do send ra
canonical for example is actually a bit
of a crawl signal like we'll try to get
uh crawled to pick up a rail canonical
Target if it hasn't been crawled before
so we do talk to them we do communicate
with them for some cases where we're
like hey this is a thing you should look
at um but we don't have any code that
says hey wake up and inspect these dupes
Pages because we don't know unless they
crawl them that their signals have
changed oh of course it's kind of like
if it's blocked by robots text like how
can we tell what you changed on your
page we don't know yeah interesting okay
so we we should have a podcast with
someone from the crawl team marin oh
okay noted noted yes all right uh if you
have any questions to the crawl team
please let us know in the comments we
are really looking forward to hear if
people would like us to talk a little
more about crawling with the crawl team
that that's an interesting one cool but
other things can go wrong as well I mean
we talked about X default and uh
localization being in iceberg I mean I
could imagine accidentally serving some
different language than you actually
specify in the hre Lang setup so if I
have like the German version that
accidentally for whatever reasons pulls
data from I don't know the Spanish
version um does that Tinker or collide
with with clustering as well or do you
just go like okay they signal this is
the language version X and we don't care
or how does that work is that a
different team as
well one of the parts of the
localization iceberg is that there are
multiple teams this this is a problem
that crosses the stack um oh boy what
you're describing there to be honest I'm
not sure I completely followed the
example but uh mislabeling your content
is not something that the dupe system
worries too much about in terms of
languages so from my
perspective we probably didn't even
notice that that happened um it would be
might be more interesting to ask that
question to like someone from
serving but I yeah I don't have a good
answer for that all right okay serving
also goes on the list we will find
someone yeah I'm just I'm giving you all
sorts of other people to interview at
this point well this is useful that's
that's fantastic I don't I'm this is
fine this is perfectly
fine luckily you already interviewed Zoe
for rendering so you don't need to worry
about that one that is true and I
actually work with Zoe quite a bit uh
because we have all sorts of interesting
edge cases and problems and I'm pretty
sure there's edge cases for you in
clustering as well what is like a really
interesting Edge case that you
encountered for clustering mhm well okay
so
given the the the likely audience here
the one that's probably most interesting
for them would be when I see people who
put junk into the real canonical field
so like sometimes it's a script gone
wrong and you can see that oh there was
supposed to be some sort of variable
evaluation that didn't happen so you see
like dollar sign variable name or
something and then so all the real
canonical on the site are suddenly
pointing to hostname SL variable or in
another case I've seen people just leave
the field empty and uh that has a
meaning oh wait wait wait wait wait wait
wait what does that mean uh so I think
the parser actually turns it into just a
for slash
oh like it's it's a relative it should
be a relative path but I think I think
it actually goes down to like the root
of the server so uh it's basically the
same as saying please wipe my S out okay
you have to be really care we we so I
should be clear here we have some
validation in place to try to break real
canonical when we think they're wrong
but this is another Iceberg like we have
a we have a very old feature that is
essentially being leaned on to do this
and the new feature that we would like
to use to do it has been in development
for years at this point so are we ever
going to have good rail canonical
validation I don't know but in the
meantime the one we've got is imperfect
and if you make mistakes we'll catch
some of them and we'll let some of them
through I I think the solution is to use
an
llm we just
B like given this HTML header what do
you
think
John I'm really curious what it would
say maybe it would start to cry Martin
yeah sit in the corner and cry that's uh
that's the APT response oh my God okay
that's that's bananas all right so
there's there's a lot going on in dupes
clustering I I think that's that's
really really interesting and I I think
the one takeaway that you can probably
take out of this as a website owner is
make sure that every signal points in
the right direction like if you want one
specific URL to be like showing up in
search results then make sure that we
can understand okay this is this
specific version this is the best
candidate for this cluster of URLs
pointing to the EXA same content or neic
content I I guess that's that's the
biggest takeaway is that or what would
you say people should take away from it
and also HTTP status codes I think yes
oh yeah yeah so just to follow up there
is actually a fairly authoritative
external list on what uh Web Master
signals we use in canonical selection I
I actually looked it over recently and
it's still basically up to date I think
the one thing that might be missing from
it is xlang default is now uh kind of
important but the rest of them like site
m 301 real canonical they're all there
cool that's that's in our documentation
uh so we should update that maybe it'll
be ready by the time this episode comes
out cool so if you see a documentation
update uh done recently then you know
yes that has
happened awesome that's really really
exciting okay that that was super
interesting Alan thank you so so much
that was really really good you're
welcome and thanks John for being here
with me um I think that's it for our
episode huh yeah thanks a lot I think
next time on search of the record we
will be reflecting about the oh God
about the year in search already the the
end of the year is coming closer huh oh
my gosh wow already okay okay before we
get stressed about the fact that the
year is ending I'd like to say again
thanks Alan for being here thanks John
for being here and uh thanks everyone
out there for listening in with that I'd
like to say
goodbye bye
bye we've been having fun with these
podcast episodes I hope you The Listener
have found them both entertaining and
insightful too feel free to drop us a
note on LinkedIn or chat with us at one
of our next events we go to if you have
any thoughts let us know and of course
do not forget to like And subscribe
thank you so much for listening and
goodbye
[Music]