Handling Dupes - Same Same or Different? | Search Off the Record
2024-12-05 ยท en automatic
[Music] hello and welcome to another episode of search off the record a podcast coming to you from the Google search team discussing all things search and having some fun along the way my name is Martin and I'm joined today by John from the search relations team of which I'm also part of hi John hi Martin and we have a special guest Alan Scott from the dubs team hi Alan dubs dubs dubs dubs dubs internally we call it dupes but okay oh I'm not a I'm not a native English speaker for me it's dups okay so you're like you're actually right we spell it wrong we we put dups and so everyone outside should think it's dups but for some reason we always called it dupes but but I think externally we call it Canon a so which is even worse it's it's yeah it's fantastic isn't it I've been fighting that terminology for years oh oh really let okay before we get into that would you be so kind to introduce yourself to our audience of course of course so my name is Alan Scott I am an I'm so software engineer at Google I've been here over 12 years now I think and uh I have spent almost all that time working on the problem of duplicate detection and elimination which uh wraps into other friend problems s like signal forwarding and these days even starts pulling in other Wilder topics uh from the fringes like error pages and localization so uh yeah oh wow all right so we've we've started this off with me mispronouncing dupes and you telling us that when externally we talk about canonicalization you really don't like that why don't you like the term canonicalization to be fair that's something that I go up against usually more internally cuz uh when people think canonicalization they sort of Imagine This one black box that does all the magic things together and uh it's very difficult to handle requests from people that are like well why is canonicalization wrong and and so um I I tend to push people to think of it as well canonicalization is one step it's I have a bunch of URLs and I want to know which of them is the canonical but there are other steps that are as if not more important here like the the first one being clustering oh usually when people come to us and complain about canonicalization the immediate thing we say is oh that's a clustering problem because these two pages shouldn't be in the same cluster let alone cases of canonical selection like if you want to bring a canonicalization problem to me what that is is these two pages are in the same cluster but they aren't actually like we picked the wrong one like the most dire case being a hijacking uh we see those and we act really fast cuz those are just disasters so so clustering is basically taking the pages that we think are the same and then canonicalization is from those pages which one is the best one is that about right exactly yes okay yeah so for example real canonical is a bit um bit of a magic factor that crosses both these lines real canonical will actually it will first try to put two pages in the same cluster it may or may not succeed but if two pages are in the same cluster and there is a real canonical between them then it's also a canonical selection signal oh so you say it's a canonical selection signal does that mean that there's other things that could be a signal for canonicalization uh I'm not sure what the exact number is right now because it goes up and down but I suspect it's somewhere in the neighborhood of 40 whoa okay well now our listeners will be making spreadsheets with 40 signals like like they used to do with those 200 ranking signals that we had but I I think if I remember correctly hdp versus https is one of them yes uh there's actually multiple criteria that try to deal with that dimension in specific cuz we want to get that right but um it's not as easy as it might seem the general guiding principle we have is we want to sort of what you see is what you get for the end user where if we give them an htps page page then it should actually be secure whereas if we don't think it's secure they should get an HTTP page um that means that sometimes we follow the Web Master signals and sometimes we don't because web Masters might do things like hey my htps page redirects to my HTTP page and then to a different https page that's not secure so that will get you pushed to an HTTP canonical if we can manage it interesting I I guess the the issue of multi steps of redirect that's that's challenging in general right yeah it's like finding which which one is the right one to to show or which one is maybe something tied to personalization or the the location of the user it's funny actually uh this all kind of links together here uh because we just came off HTP versus htps and now we're talking redirects just recently uh I I made an effort to sack one of criteria and I'll give away the name it was called redirect to shorter and it had a really bad interaction with htb htps because if you had conflicting signals come from the Web Master this one would push you to http oh so we wanted to get rid of it for the longest time oh that extra letter yeah literally just that extra letter it's this is why I like go ahead make your spreadsheets some of these Criterion are not very smart some of them are very tricky but some of them are also very very basic heris oh my okay wow but why do we even need like 40 plus minus X signals I mean website owners never make mistakes and give you the correct canonical all the time right so when it comes to trying to figure out how to waight things one of our biggest problems is we don't know what to do when Web Master sends us conflicting signals um the two most common that come up there would be 301 versus real canonical um like those are both very strong signals if your signals conflict with each other what's going to happen is the system will start falling back on lesser signals so it'll start listening to things like site Maps or page rank or the now deceased redirected to Shorter okay so if if you have conflicting strong signals then basically you're saying these don't matter we just don't know how to train the system in those cases because like how does a human evaluate that at the end of the day we can only train the system as well as a human can evaluate what the correct answer is we just don't know once web Masters start giving us confusing signals like that I I guess you know you can't train a system to just sit in a corner and cry because that's what a human would do in that case yeah we we train the system to be ambivalent okay that all right so we've heard about redirects we've heard about clustering and that actually the clustering bit reminds me of something that keeps coming up and I think this got a little worse since Google search console started primarily reporting only on canonical URLs and that is when you have a um a website that is in three regions that have near duplicates so let's say Germany and Switzerland both use German MH and high German at that in in text uh in written content and then you have like a product page and it's pretty much the exact same information except the price and the currency add and website owners make a huge effort of like making that they tell us so this is the version for the Swiss market this is the version for the German market so like they use atang and all these lovely things that we have and yet one of these gets chosen as the canonical and shown in reports and then also this canonical sometimes changes you know it makes things uh interesting let's put it that way how does that work but I I think it also plays into the clustering bit right if you tell us that it's kind of the same but different language versions is that is that part of a cluster then this would be the localization iceberg that we're now encountering you you you can see the tiny sliver above the waterline and then there's this giant mass underneath if this topic seems confusing externally it's also confusing internally we we have been trying to make localization work in a reasonable way for a very long time um because it's a very challenging subject so you're asking about how clustering works with localization well the answer is it depends oh people love that externally people love when you say it depends yeah so so internally there's essentially two categories of localization types there are the the localization types where it's just a boilerplate translation which is something you see very common especially with big social media sites they they don't translate the the content whereas there are also translations that are full translations where you will see the actual content of the page fully change yeah and I mean the boilerplate bit is pretty pointless right I mean yes largely speaking it does not help a lot for people to see hey this is the you know the Swedish version of your favorite celebrities social media feed that that is not something that we're really concerned with h doing out for people but uh the the full translation pages should not cluster because they have different tokens they're going to retrieve for different queries so we don't want them in the same cluster we want to have all those pages available for retrieval the boilerplate translations we want to put into the same cluster and uh and that means that they'll consolidate signals but it also means that we don't have to crawl every single localization variant because to be honest you know we're wasting your bandwidth and and we're wasting our space By by doing that so that's why it depends uh there's there's two different ways we want to handle these things and you know what what which one you're doing matters and and then you get the really complicated ones like what you said where they just change the price and those ones become more complicated because it's it's basically the same content but for one token but that one token really matters um and in that one token case we still want to have them in different clusters that's a more challenging problem in theory than you know not putting two language variants in the same cluster but uh you know that's why localization is a hard space in in the case of boilerplate translations would we still try to swap out the URLs when we show them in search or oh absolutely uh so sitting on top of all of this talk about clustering which is the dupes system on its own there's hre Lang which is a basically a separate system where if you put in the annotations we will try to substitute them um John knows that uh there is a project right now which may or may not be live by the end of the year um that is attempting to increase the reach of that specifically so we want to serve more hling variants we want to utilize that more but we need to put in place mechanisms that will determine basically how much we can trust it on a given site so we're doing some crawl and verification basically to determine you know is this site serving its map correctly uh and if so then we're going to try to serve that more often without necessarily having to verify it as much as we currently do okay I I guess that would also work for the Swiss and the German versions hopefully yes I'm not super familiar on the specifics between you know German German and swiss German but if there are minor differences then I would expect this to be able to say oh you're from Switzerland and there's an H ref L entry for Swiss German so here you go this is the right page for you cool that's pretty nice yeah that that sounds interesting and uh with X default do you find that's something that sites generally use correctly or it it always feels tricky to explain that because by the time you get to that it's like their head already blows up from all of the at laying so um Martin was asking me about canonicalization signals earlier X default is actually a signal and uh not inconsequential one I don't know that it is used very commonly it does seem to be used reasonably well when it is used uh I kind of wish people would use it a bit more to to put this in perspective you've kind of got two tools here one of them is Rel canonical which says hey I'm supposed to be clustered with this other page and that other one is supposed to be canonical X default is more of a hey if you don't know what a local what local to do or or I wind up in the same cluster as this other page that's the one you want for retrieval and that sort of thing it is a sort of real canonical in a way but not for clustering just for canonical selection as long as the signals align I guess if if you then use other things like X default to one thing but then real canonical to another thing is probably confusing signals again no yes but that's sort of expected in a way right uh like we have to make accommodations for that in this specific case because you could imagine say I have multiple different versions of this Swiss page and I also have multiple versions of this German page and I want to real canonical those guys into their own independent clusters but then I also want them to be a member of this hling map oh okay oh my God yeah no I this is a complicated subject which is you know why you know when I started it's like it depends This is complicated there's an ice B we're we're now starting to descend now now you can start to feel the the joy of of dealing with localization mechanics oh boy do do you think there will be a simpler variation to do localization at some point I I remember like it's at some point Gary and I sat down we discussed options and then my simpler solution was to use a set of regular expressions and then I realized this is not the wrong direction a a set of regular expressions and you call that a simpler mechanism exactly uh yeah this topic has been one that I've been hearing about since I joined the company um and and good ideas in the space have been hard to come by which is why we're kind of running with the best we've got right now so you know you're you're you're rolling your eyes and you're you're you're you're nodding your heads and saying oh God this is a mess and yes it kind of is but we don't have better Solutions and in the meantime things have just been a mess anyway so so why not just run with something that is at least slightly better than the status quo is kind of where I'm hoping to go I mean some of the more advanced folks that are working on these kind of international sites they kind of understand what to watch out for what to do and for those of you out there who are wondering what are we talking about and what is this internationalization we did discuss this Gary and I in episode 78 of this podcast we're going to link that in the description below as well so that you can listen up on internationalization and joy and fun but uh oh boy uh it's an iceberg I I see that yeah I can see that but that's not the only thing that you do in clustering dealing with localization I guess you have other fantastic icebergs such as uh Arrow Pages you mentioned ah okay so so this is can can I start by by by threatening people with marauding black holes [Laughter] what error pages and clustering have an unfortunate relationship where undetected error Pages just get a check sum like any other page would and then cluster by check sum and so error Pages tend to Cluster with each other that makes sense at this point right oh oh is that these cases where you have like a website that has I don't know like 20 products that are no longer available and they have like repl it with this item is no longer available and it's kind of an arrow page but it doesn't serve as an arrow page because it serves as a HTP 200 but then the content is all the same so the check sums will be all the same and then weird things happen right so that's a good example yes that that is exactly what I'm talking about now in that case the Web Master might not be too concerned because these products if they're if they're permanently gone then they want them gone so it's not a big deal now if they're temporarily gone though this is a problem because now they've all been sucked into this cluster they're probably not coming back out cuz crawl really doesn't like dupes they're like oh that page is a dupe forget it I never need to crawl it again um so that's why it's a black hole only the things that are very towards the top of the cluster are likely to get back out um and this is where this really worries me is uh sites with transient errors like what you're describing there is sort of a like an intentional transient error but you know let's say that you've got 39's reliability oh no well one out of every thousand times you're going to service your error and now you got a marauding black hole of dead pages and it gets worse because you're also serving a bunch of JavaScript dependencies JavaScript and if those fail to fetch they might break your render in which case we'll look at your page and we'll think it's broken so the actual reliability of your page after it's gone through those steps is not necessarily very high yeah um so we have to worry a lot about getting these kinds of marauding black hole clusters from uh taking over a site because stuff just gets dumped in them like there were social media sites where I would look at the you know the most prominent profiles and they would just have reams of pages underneath them some of them fairly high-profile themselves that just did not belong in that cluster oh boy okay yeah I've I've seen something like that when someone was AB testing a new version of their website and then certainly would break with error messages because the API had changed and like the the calls no longer worked or something like that and then in like 10% of the cases you would get like an error message for pretty much all of their content and uh yeah getting back out of that was tricky I guess yeah I've I've also seen something that I assume is similar to this where uh if if a site has some kind of a CDN in front of it where the CDN does some kind of bot detection or dos detection and then oh yeah Ser something like oh it's like it looks like you're a bot and Google bot is yes I'm a bot but then all of those pages I guess end up being clustered together and probably across multiple sites right yes basically Gary uh has actually been doing some Outreach for us on this subject you know we we we come across instances like this and we do try to get uh providers of these of services to work with us well least work with Gary I I don't know what he's what he does with them he he's in charge of that but uh not all of them are are as as Cooperative so uh that's something to be aware of and and I guess sites would notice this in search console when when it says like Google picked a different canonical and then they look at it and it's like this is a totally unrelated page how does Google come up with this idea yeah that's this is the kind of thing that's leading to that yes but what do I do so this black hole sounds really scary especially if you say like oh it's really hard to get out of it again if it happens for whatever reason or if I'm launching a new website or a new revamp of a website or new version of a website how can I as the SEO on that website make sure or what what do I need to look out for to avoid this black hole uh the easiest way is to serve correct HTTP codes so you know send us a 404 or a 403 or a 503 and and if you do that you're not going to Cluster we can only cluster pages that serve a 200 oh only 200s go into black holes okay that's a good statement I I like that that's a that's a pretty good one only 200 ghost into the black hole the the other option here is um if you are doing JavaScript Foo in which case you might not be able to send us an HTP code might be a little too late for that uh what you can do there is you can attempt to serveice an actual error message something that is very discernably an error like you know you could literally just say you know 503 this we encountered a server error or 403 you were not authorized to view this or 404 we could not find the correct file any of those things would work um y you you know we even need to use HTTP code obviously you could just say something we do have well we have a system that's supposed to detect error pages and we we want to improve its recall Beyond it currently does to try to tackle some of these bad renders and these uh you know bot serve Pages type things but um in the meantime it's it's generally safest to take things into your own hands and try to make sure that Google understands your intent as well as possible and I I think externally we call these soft 404 Pages yep okay and internally we we sometimes call them crypto 44 yeah that's that's the term I'm more used to yes okay uh quick question I usually recommend in this case so we do have like client side rener or single page applications uh where we have this problem that you can't change the HTP status code but you could use JavaScript to redirect to a page that is statically set to return a 404 or 500 or whatever it is would that also avoid this clustering issue uh I think so yes uh tler usually straps those redirects together for us at indexing time so we would effectively see your page as the HTTP result at the end of the chain mhm okay and the other option we we sometimes tell people is to use a no index on an page that basically says 404 does that make sense I guess if it's a page that is supposed to be permanently gone then it would be clustered with others so yeah so from my perspective if you serve us a no index that's very different from serving in uh an HTTP error code if you service an HTP error code what actually happens is we'll say oh this page suddenly went error but maybe it isn't supposed to be so we give you a bit of a grace period before we remove you from the index if you serve us a no index we're like oh they went no index get this out get remove this they can't we can't serve it so you're gone okay okay there's a different urgency to these two things that's interesting yeah so I I would suggest not necessarily serving no indexes on error Pages uh unless you really want us to remove that page if it's permanently an error then go ahead no index at all you like um but if it's temporarily an error no no no no interesting okay so those are things where where the content is clearly like an error has has malfunctioned in some way and then we get an error but what about things where we just make mistakes like what happens if I accidentally cluster a bunch of near duplicates into a canonical situation and then realize oh no I didn't want that can I undo like if I fix my real canonical after things have been clustered is that another black hole kind of situation or are you like oh okay yeah that one signal has been fixed I kind of want to to punt you over to the crawl team on this one okay the the the problem with this is that it's very much on crawl to decide when to crawl things and I believe that web Masters do have some recourse here they they can request crawl to some extent and don't know how effective that would be in these cases because I'm not part of the team that schedules crawl so I I can't tell you how much they actually listen to that feed I think they do somewhat but it's not a dupes problem well I mean all of these problems are related like we actually do send ra canonical for example is actually a bit of a crawl signal like we'll try to get uh crawled to pick up a rail canonical Target if it hasn't been crawled before so we do talk to them we do communicate with them for some cases where we're like hey this is a thing you should look at um but we don't have any code that says hey wake up and inspect these dupes Pages because we don't know unless they crawl them that their signals have changed oh of course it's kind of like if it's blocked by robots text like how can we tell what you changed on your page we don't know yeah interesting okay so we we should have a podcast with someone from the crawl team marin oh okay noted noted yes all right uh if you have any questions to the crawl team please let us know in the comments we are really looking forward to hear if people would like us to talk a little more about crawling with the crawl team that that's an interesting one cool but other things can go wrong as well I mean we talked about X default and uh localization being in iceberg I mean I could imagine accidentally serving some different language than you actually specify in the hre Lang setup so if I have like the German version that accidentally for whatever reasons pulls data from I don't know the Spanish version um does that Tinker or collide with with clustering as well or do you just go like okay they signal this is the language version X and we don't care or how does that work is that a different team as well one of the parts of the localization iceberg is that there are multiple teams this this is a problem that crosses the stack um oh boy what you're describing there to be honest I'm not sure I completely followed the example but uh mislabeling your content is not something that the dupe system worries too much about in terms of languages so from my perspective we probably didn't even notice that that happened um it would be might be more interesting to ask that question to like someone from serving but I yeah I don't have a good answer for that all right okay serving also goes on the list we will find someone yeah I'm just I'm giving you all sorts of other people to interview at this point well this is useful that's that's fantastic I don't I'm this is fine this is perfectly fine luckily you already interviewed Zoe for rendering so you don't need to worry about that one that is true and I actually work with Zoe quite a bit uh because we have all sorts of interesting edge cases and problems and I'm pretty sure there's edge cases for you in clustering as well what is like a really interesting Edge case that you encountered for clustering mhm well okay so given the the the likely audience here the one that's probably most interesting for them would be when I see people who put junk into the real canonical field so like sometimes it's a script gone wrong and you can see that oh there was supposed to be some sort of variable evaluation that didn't happen so you see like dollar sign variable name or something and then so all the real canonical on the site are suddenly pointing to hostname SL variable or in another case I've seen people just leave the field empty and uh that has a meaning oh wait wait wait wait wait wait wait what does that mean uh so I think the parser actually turns it into just a for slash oh like it's it's a relative it should be a relative path but I think I think it actually goes down to like the root of the server so uh it's basically the same as saying please wipe my S out okay you have to be really care we we so I should be clear here we have some validation in place to try to break real canonical when we think they're wrong but this is another Iceberg like we have a we have a very old feature that is essentially being leaned on to do this and the new feature that we would like to use to do it has been in development for years at this point so are we ever going to have good rail canonical validation I don't know but in the meantime the one we've got is imperfect and if you make mistakes we'll catch some of them and we'll let some of them through I I think the solution is to use an llm we just B like given this HTML header what do you think John I'm really curious what it would say maybe it would start to cry Martin yeah sit in the corner and cry that's uh that's the APT response oh my God okay that's that's bananas all right so there's there's a lot going on in dupes clustering I I think that's that's really really interesting and I I think the one takeaway that you can probably take out of this as a website owner is make sure that every signal points in the right direction like if you want one specific URL to be like showing up in search results then make sure that we can understand okay this is this specific version this is the best candidate for this cluster of URLs pointing to the EXA same content or neic content I I guess that's that's the biggest takeaway is that or what would you say people should take away from it and also HTTP status codes I think yes oh yeah yeah so just to follow up there is actually a fairly authoritative external list on what uh Web Master signals we use in canonical selection I I actually looked it over recently and it's still basically up to date I think the one thing that might be missing from it is xlang default is now uh kind of important but the rest of them like site m 301 real canonical they're all there cool that's that's in our documentation uh so we should update that maybe it'll be ready by the time this episode comes out cool so if you see a documentation update uh done recently then you know yes that has happened awesome that's really really exciting okay that that was super interesting Alan thank you so so much that was really really good you're welcome and thanks John for being here with me um I think that's it for our episode huh yeah thanks a lot I think next time on search of the record we will be reflecting about the oh God about the year in search already the the end of the year is coming closer huh oh my gosh wow already okay okay before we get stressed about the fact that the year is ending I'd like to say again thanks Alan for being here thanks John for being here and uh thanks everyone out there for listening in with that I'd like to say goodbye bye bye we've been having fun with these podcast episodes I hope you The Listener have found them both entertaining and insightful too feel free to drop us a note on LinkedIn or chat with us at one of our next events we go to if you have any thoughts let us know and of course do not forget to like And subscribe thank you so much for listening and goodbye [Music]