Internationalization & hreflang | Search Off the Record
2024-07-25 ยท en automatic
[Music] hello and welcome to another episode of surge of the record a podcast coming to you from the Google search team discussing all things search and having some fun along the way my name is Martin and I'm joined today by Lizzy and Gary from the search relations team of which I'm also part of hi Lizzy hi Gary how dare you okay how can you just leave him hanging like this you're supposed to say hi back yes hi back I mean he's following instructions so that's we having a literal day yeah we are having a literal day uh speaking of literals uh the at Lang implementation ah why would you know I have questions I have so many questions and my computer and why now like what what did I do something or you did indeed do something um you you changed the documentation recently no I keep doing that yes for for a l i mean oh maybe and you threw me under the bus I am on the hook for a at flang or like multinational s Q&A situation soonish okay I don't know where where did you see the boss because it was like me sending an email does anyone want to go and then you were like yeah I will jump on that bus I think he just launched himself in front of the bus like the bus was coming okay but I do I do have a question um just one no I have actually a bunch of questions so the thing the thing that I know about it is I have in the past implemented at Lang on a small side and I know how that works and I know what our guidelines are for these but everyone keeps telling me it's very complicated and it's so easy to make a mistake and it's so complex and I'm not seeing this complexity okay so I'm wondering where does that come from is it because it was a small site probably I don't know I think it comes when your site is big or you are uh managing multiple properties and the multiple properties each have their own URL structures because like ideal not ideally well yeah like in in an in a utopistic view you could have like example.com SL um CH no uh DCH slash and your pages and then for English you would just have nus and it would work it would be so simple but then let's say that in another jurisdiction you need to have like a separate domain like you need to have like example. es for example I don't know what you are selling but maybe you need a different domain name and then maybe on the different domain name you can't follow that uh pattern anymore so you absolutely need to be able to annotate separately everything like every language country variation and as soon as you have to vary things things that's where you start introducing errors true because then you can't just apply it by a pattern yeah yeah okay fair and then and then if you have let's say multiple like many domain names for many regions and each has to follow a different publishing sematic or whatever that's where your errors can come into play what do you mean by publishing sematic like I don't know like URL structure or something like that I don't know so it's harder to sync them up because maybe you're localizing the URL or something and then you don't know that these things match and you make typos and then like for example with us on on onesie what is it google.com/ search there we don't localize the URLs yeah well no we don't localize the URS we have a parameter that changes the the language of the page the AGL parameter but the words in the in the URLs they are not changing they're not changing but we have inherent did some non-english URLs when we migrated the blog and we had hell of a time with that yes ah so that's where also uh complexity can be introduced by just inheriting like a grown structure and then it's some some of it is like this and some of it is like that and yeah so that's what I was wondering when you uh said that you implemented HL uh for the small site was it like were you in charge of the whole plan or were you dealing with already localized content then have to implement something Green Field project that I think made things a lot easier we started from scratch and we knew that we would have multiple language versions from the get-go okay so you can set up the plan and make sure that it makes sense from your business strategy for example and I I wasn't involved in all the discussions but like we had language subfolders or sub directories so it was like slash slash and then whatever and then SL de SL whatever I don't know why we chose that I know that there's pros and cons to that versus subdomains or just completely different top level domains I have no idea why we chose that I it doesn't really matter does it I mean from from a search perspective it doesn't matter uh no really like not even like the dot sayha doesn't matter no sub subdomain doesn't matter like if you havec c.de doit do but that's a CCT ID yeah yeah but that that matters that that that would matter yeah ah I I think and we switch like more into the internationalization topic and we are not on Agia flank anymore I I think eventually like in years time that will also fade away like that benefit oh um because it's not as reliable anymore as it was back in 2000 that's true reliable on what sense that it could accurately tell you that this domain is uh related to the country Switzerland oh like imagine like just think about the the all the funny domain names that uh that you can buy nowadays like the do aai like U I think that's Antigua Antigua or something yeah or the CCT ID for Antigua like it doesn't say anything uh anymore about the country like it doesn't mean that it's like the content is for the country I see because people are running out of spaces at the dot level so they're going I don't think they are running out of spaces they like it just can be used to be more creative with the language I was about to say I had I used to have the domain uh 50 lines of c.de because it's like a coding blog so it was like code it reads out as code 50 lines of code it it had nothing to do with Germany yeah I didn't live in Germany it was not in German did that cause any problems I don't think so I got good traffic on it well I don't think that it causes problems necessarily because one of the main algorith that do the whole localization thing uh that is called something um ldcp language demotion country promotion so basically if you have like a dode then for uh users in Germany you would get like a slide boost with yourte domain name but nowadays like with Co um whatever dco dode which doesn't relate to Germany anymore doesn't really make sense for us to like automatically apply that that little boost because it's it's ambiguous that uh what the what the target is okay because it's a the topic of the site in combination with the language that Martin is writing in the language is absolutely not a hell uh for what what country you are targeting would have been my next question like the blog clearly was English and it was a de domain so I didn't get the the Boost H okay interesting so like I think from a marketing perspective there's still some value in buying the cctlds and if I if I were to run some spam Network sorry like a new business um then I would try to buy the the country tlds when I can when like it's monetarily feasible but I would not worry too much about it okay and with things that are not uh cctlds let's say like I have something.com /de and most of the content in there is German and then there's a few pages in English is that something that search would not like or is that like whatever not like be confused by so what what would the effect be of that if I for some reason have English content in slte no I'm still stuck on you anamorph ising Google but uh Google is cute would it cause problems I don't think so I can't come up with a scenario where it would cause problems um because the like we can go down to page level when we are trying to promote something for a user in a specific country um it doesn't have to be the site okay like it can be the site but it doesn't necessarily have to be okay like we could know that or we could have learned over the years that uh Martin cod. day that is a global site mhm but then we see that some specific page or section of the site is specifically targeting people living in Switzerland no in oh now I want to like move away from Global so back to Germany okay um in U Berlin or frankurt okay um and then we could say that those pages should get a boost in uh in German aha even though they're only targeting one specific region of Germany we we look at it from a country's perspective yeah okay and when you say uh you're targeting is it the content on the page or another part that is showing let's go with content on the page content on the page okay okay I have another uh question this actually does come back to atra flang again um so say I have an atra annotation on the English version of a page and it says this is the German version of this page but then for some reason I made a mistake and the actual content on the page isn't German is that going to cause problems or not not from a flank perspective as far as I remember on so I worked on the parsing implementation plus the promotion implementation of a l and back then it didn't cause problems okay um but that was like 2016 or something okay that's a few years back uh like since then we changed so many things um that I like I would have to check whether it causes problems my I I think it wouldn't okay because like the like when I when I said when I spelled out ldcp then I said the language demotion country promotion so for example if someone is searching in German and your page is in English then you would like get uh like a negative like the demotion in the search it's less relevant to the query to the person unless you are searching for something like how do you spell banana right okay fair enough um because then it doesn't really matter like what kind of um well no it does like it still matters but yeah because you're searching for something in English so we would think okay you want some page that explains how to spell banana in English not German but then we also know that the German is fine like previous queries for example oh h i mean for personalization we we do use these these kind oh so if Martin happens to write an English how do I spell banana but they know that also likes German content no yesterday when we were looking for something on on on onesie and we kept getting the German version like when I went home then I looked into it and it was personalization basically because right before that I was searching in German oh okay interesting so that that can also happen but like if you break away from personalization then generally we we will try to push down languages that are not in the current alel language alel language being the the interface language that you are searching in oh I see like the like google.com itself in using the German UI phrases and going back to more of developer like questions there's the Lang attribute sure on HTML tags sure do we care about that okay I had a feeling that might be unreal relable for most of the pages it is and that's because like if you think back like earlier days I remember that I think jumla just came with the Lang attributes set to English and there was no way to override it even if you Chang the the template to German or whatever so it was just like basically yelling that hey I'm English I'm English I'm English I mean and then you looked at the page and it was 100% German so it was like okay so it became not not a reliable thing to look at so therefore it doesn't matter if you get it wrong or right we're not looking at it right correct and I don't think that we would want to change that either um because if John were here then he would probably argue against this that I'm going to say but like ultimately I would want less and less annotations site annotations and more automatically learn things okay because because it it is very likely more reliable I'm on the fence about that because if it goes wrong then I don't have the means to self-correct that I think we should have overrides okay but we should be able to for example learn Agra L automatically huh entirely like even when I worked on a flank like we already had something that was automatically learning mhm that two set two pages are the same or different versions of the same content M like we we could already do that like and this was what like almost 10 years ago wow I'm old now with the advancements that we have with AI and all that weirdo stuff wow you're so excited about it I'm loving it I think we are overusing that term but that's podcast for another day what you need is a regx yes um is regx the new AI regx the new AI uh anyway but if like almost 10 years ago we could already do that um really quite reliably H then why would we not be able to do it now okay is it for the fail safe thing like we don't have another option for you to do that override like for people to say that this is what I want you know how sometimes we lose interest in things oh we're just bored well it's like it's not that broken so they just let it continue and also me being cynical so sure okay but I mean if we say that the language attribute is not reliable do we still think that atra Lang attributes are than reliable so if I'm marking up something as this is the German version then it turns out to be the English version we are kind of having the same problem no um I think not um and that's because the Lang attributes typically come baked in in whatever publishing software you are using a lang is not so with a l you have to put in some extra effort to deploy it it probably costs a lot so you actually pay attention like if you are deploying for someone like like a bigger company like let's say big e-commerce site nile.com and then you have to add it to every page in the either in the sitemap or in the in the HTML or HTTP headers that's probably a very big investment monetarily um and then you want to ensure that whatever you're doing is actually correct okay like there's an incentive for you to to ensure that it's correct with given things like the Lang attribute that comes with your CMS um or um actually the the IDE might actually just give the HTML template if which already has like uh Lang equals so why isn't H Lang baked in just like Lang would be for from a CMS perspective why are they not do Lang is part of the HTML standard and I think to be a valid quote unquote HTML document you have to have a language attribute set ah okay and H Lang is so out of the box you have to have that so therefore it already has it but for hling you don't need it necessarily it's a thing that you can have but you don't have to have it to be fair Lang you don't have to have it either the browser will just work as fine but um people are still using like validators that are like this doesn't have it's not Lang attribute this is not fine and then they're like okay fine I'll just set it to English also I I think the number of sites that actually need a l is is really tiny yes compared to the web compared to the web yeah yeah yeah I looked this up before this episode that the web albanac that 9% of sites have a truffling uh in 2022 of sites or PES homepages the crawl in 2022 only crawled homepages but when I read that number I also didn't know like what does this mean is it a lot is it not very many that's a sounds like a lot to me that's way more than I expected like way more like one 2% what just said five if if it were like pages on the internet then it would be fine I think but if it's just homepages then it feels weird and 5% out of the nine I don't know if I'm saying that right but uh English was the the bulk of what they said as the a trifling for the home page interesting I wonder if that's like a misunderstanding and people were like oh yeah we have to have like a version of Atri Lang that says this is the English version even though if we don't have any other language versions but then again to be fair I think there is a lot okay I do believe there is quite a large portion of websites that do have multiple language versions okay 9% no I think 9% is actually reasonable really I think so if you think about it a lot of companies are doing a multil language service and even like smaller restaurants are now offering their menus in in German and English but you are in Switzerland yeah okay fair enough that's true but like even Germany and but in the US you have Spanish sure oh and then there's like legal requirements like Canada all of the official websites at least but that they are not homepages they're not necessarily all H H interesting anyway yeah anyway I mean I believe the tat but I just I'm just weirded out yeah it's it's a surprising number I think so as well uh you said something about H FL coming from HTTP headers HTML tags or the XML sit map right yes is it a leading question no no no what I'm wondering is so the HTML tags and the HTP headers are relatively close to each other in in terms of where they are being ingested but the XML sit map is slightly does does it make a difference where I Implement H Lang do I have to implement all of them or can I just get away with just doing one of them and if I do one which one so the ingestion that's um yes it's coming from different parts of the pipeline like site map annotations come from the feeds ingestion point oh yeah true that's feeds and the headers and probably no the headers for sure come from Google bot no sure I mean everything comes from Google but like HTM in the calling part of sure um but like internally the the representation of whatever we CW that will have the headers and then it will have the body like the H HTML or the HTTP um body message as well um and then from there we are doing processing with it um technically it doesn't matter where you provide it but I think that uh it can be faster if it's in the HTP headers or in the HTML because feeds like we process it eventually like at one point M um but it's not tied to a specific page okay like if you publish something like you want to get that uh weird um non GMO centered uh candle uh online as fast as possible and sell it on multiple markets um then you probably want that annotation to be seen as fast as possible by googlebot because there is a like a dependency craw triggering when we discover age ring like we want to verify that mhm that that's truly a language variation yeah okay um so you want to to probably have it in the HTML but of course that's not always feasible because you might have like nile.com that has like billions and billions of pages and then you want to put it somewhere that doesn't cause problems for your serving infrastructure because HTML blocks can be really big if you if you support like um I don't know like once he supports what 11 languages or something uh 18 sure 18 but that's not that big of a block I mean like well bwise yes if you add it up for the whole site or like per I mean if you add it up of course less than a kilobyte no yes a kilobyte is not much but if you are serving but if you are serving a billion times okay um but we cash no well no we don't no no no but but think about the sight's perspective you have to serve that like a billion times to users oh yes yes yes yes from a sides perspective I was like like from perspective it's like a bilon Pages it's like whatever yeah are there any other reasons to have it on the page like not just for for Google bot finding uh I mean for search engines I don't think anything else uses it okay okay what about for Discovery no um I don't think it's good for Discovery other than discovering the dependency dependency yeah so verifying that the two EUR are in the same cluster I see yeah um something that that surprised me in our docs I think is uh that you have like the page has to self reference it in the set of Atri flang blocks I don't remember why we have it there like yesterday I was editing that page or and I saw that sentence and I was like I wonder why we have this here it's not actually required I know that we have it there for a reason I I just don't remember re DN like what is real because it seems like counterintuitive like why would you need to say that this is the page that I am on like isn't it obvious that you're onage I think it was somehow related to R canonical oh like we were recommending that you put a self- referencing canonical on the page with the aank block MH and I think we moved away from that like basically it's like if you put ra canonical on the page then don't do it for a ofland do it because I don't know it will save save puppies or whatever but it's not related anymore to to Agra FL and then we saw some issues and I might be making this up like a this is from memory we saw some issues and then we said that just put a self- referencing age of L on the page is this like a fail safe kind of thing where it's like if you say this then okay we have another mechanism to check that that's what you indeed meant yeah I I think so more signals basically yeah okay but I'm I'm not 100% sure about like why we say it but I I know that there was like back then good reason it sounds like it would kind of still work if you don't do it for some cases at least but all right okay I'm pretty sure it would work huh that's interesting try it we should try it I haven't tried it yet I I will try it test him but you said canonical and that brings me to a different thing um you have two language versions okay and we we both know that Swiss German and German German are at least in written form pretty similar unless you write like dialect which you shouldn't yeah um but there are differences for instance if I have a product page for a German shop you will see a different uh vat and you will see a different currency and probably also a different price and then swi like the the variations are pretty much the exact same page except for vat and price and currency sure shipping oh maybe shipping yeah Poli but maybe maybe shipping only shows up later on maybe it's not on the product detail page maybe it's in your cart or who knows where um why is shipping in your cart like if you go in the card and then it actually says how much shipping you have to I don't know my point is these pages are very Sim similar to each other so I think uh D duplication might kick in but then you at FL like oh there is different versions yeah that works right yeah because I've heard like and I I think I know why people are complaining about that not working because I think that has something to do with the reporting in Google search console oh yeah yeah yeah definitely okay okay so that's that's that one do we want to explain yes I don't know what you mean oh okay uh like what is the problem in search so what what people have told me like here I go to example dode and uh this page is not canonicalized and it's not Google says it's not or search console says it's not indexed ah but why is it I have the a length set up and everything and I'm like well it is indexed as part of the H well oh oh interesting it's not indexed but the there's refence alternate yeah okay it becomes thank you ah that aha that's ah yeah now I have the right wording no you you know when you know a concept but it's is really hard to put into words and I always said like well it's kind of index but not really so it's an alternate um so in the reporting it looks like it fell out of the index at some point ah okay and from a user perspect or site owner perspec persp are both unique they should both be index but I think like in search in Germany would still get the German version even though the Swiss version would be canonical for instance I I think it's also important to mention that search console only only reports canonical on canonical so basically the vast majority of the Agia flang clusters U when we are talking about similar languages they are not canonical so basically you are blind um to what's happening on the non-canonical pages of a cluster but that was a tough product decision because we don't store information about the alternate names we just put them in the dup cluster and just say that these are localized alternate names that we can show in search results when the user query plus settings deserve not deserve what's the word imply it for example we could swap out the URL for nile.com to n.ch if the language on it was similar and the user was from Switzerland for example I see and is this from like a efficiency perspective like resourcing wise storage in search console yes like search console has some really big sites in it and when you have to store information from for those sites as well then uh like I don't know like nile.com and the um what thumb notes um then uh each of these having like hundreds of millions of pages um then it becomes a a real struggle figuring out how to store data efficiently and then you will H you will have to make shortcuts so are you saying that this is it looks like a problem in search console but it's not actually a problem in practice but we also have announcement about this and I know that people were unhappy we try to explain it M um didn't necessarily change how people perceive it right um we consider it a feature I know that some people consider it a B I mean I I see their perspective because canonical sometimes change so like you see p it looks in Google search console it looks like Pages drop out of the index and eventually come back in and drop out again and you're like this should be a more or less straight line but it's like a squiggly graph right now what is happening and then it it turns out nothing is happening it's just that the the altern uh alternates switch and the canonical switch and it it is and nothing from the site owner perspective changed they still saying that one is the the main one and yeah cons is saying that they're switching nothing has changed but the report looks like something has and that's a the other thing is how do I test my atra flank setup actually so many people have written articles about this I'm just asking questions here yeah because you can trust random articles on the internet so all the time so Google doesn't provide an Agra L validator um we never actually have we had some reporting about it that was underused and then we removed it okay oh unfortunate for those who needed it now we are relying more like like if someone in public asks me this question that you ask then I would send them to um external tools that I know that work well MH um I remember that um ala solis's um tool works really well um if you don't know ala Sol is then why are you listening to this podcast um the other one that I think we mention in our help center documentation is Bill Hunt's um Agra L Builder maybe there's one called Merkel is this also Bill hunt or somebody else that's someone else okay the uh testing tool that we mentioned is Merkel okay so there is also Merkel um I don't know anything about that somehow interesting I mean like we put it there for a reason yeah the U is it works um but in like from from Google there's nothing to validate okay but there are tools that I can use to test that just happen to be third party tools okay cool cool oh and uh Bill Hunt is from black aouth um it's a I think it's an SEO agency um and we put those tools there because we tested them extensively and I and we know that it works okay that's good another thing that I wonder if it works if I have so there is this fallback mechanism the X default right so you can specify X default so if I have a German and an English version and my English version is my main content version can I just specify X default is the English version and then I have a German version or should I have a German an English and an X default even though the X default and the English both point to the same thing what okay do you have to have X default what is it what is its purpose so with X default you annotate a fullback page like I don't have a language for this particular user so show the user this page oh so if you have like a country selector or something you can send them over there or like some some catch all page that's like sure but you could also uh just set one page that you think you can show to users to help them navigate the site it doesn't necessarily have to be the country selector and it might be different than what is the canonical so you might want to set both as very likey you want to say both well want want to set both um because the the canonical is probably the page that they land that that they arrive on mhm okay and then the X default is doesn't have to be that no oh I didn't know that interesting I I thought the X default is kind of like wait quote unquote the language version that you just happen to want everyone who is not having a dedicated language but it could be a completely separate page it doesn't necessarily need to be the alternate version of that same page you're on it could be it could be some other page it could be a page literally saying like sorry we don't support you your loc or something but it actually can be another language page yes like on that I don't know like you want you want to confuse them and then you serve the page in in busque for example and then there will be only four people who will understand what kind of site are you managing I don't want to say this is this is getting spicier than I anticipated but let's again say I have a website that has two language versions the English version and the German version of the content but I consider the English version to be the like catch all fallback Locale for anyone can I then just specify the X default as the language like the English version as the X default and the German version or do I have to specify the German version explicitly the English version as the English version and then the English version again as the X default I think it's a ladder uh so basically you specify you set up a correct hfl cluster where you specify the different language versions and then you add another uh linkage of link with the next default set to okay the English but I can't just leave out the English one because it's already there the well you you cannot uhhuh unless you are I think unless you are on the on that particular page already so this is goes back to the to a question that you had previously like do you have to have a self- reference mhm and I think okay now I'm also going back to that question I think it's easier to set up that a clusters if you have a self reference as well um because then you can just take that cluster and just copy paste over everywhere because it's going to be same to be the same true that is true right set up from like a an implementers perspective like how I'm going to manage it and if you want to leave out the the current page I think it works I'm pretty sure it works you will be testing that later Martin will be testing the limits here but and then you can put in an X default that is pointing to wherever you want okay but I I think we describe it as something that helps users or decide what to do with your site if you don't support their language or something okay so from a usability perspective it's helpful yeah except that nice so I'm like I don't know you wrote a blog post about like underused reasons like reasons why you should do X default I think we had a reason for that b was but I don't remember what it was okay I know because it can help with the Canon Equalization ahu like if we how so pages that are in an Agra flan cluster can become e uh canonical much not much uh easier mhm I think that was the the the message of the block post and that's on top on ra canonical and we actually mentioned this in the canonization documentation on Onie on developers. google.com/ search there's got to be an easier acronym thanc search I don't know like if we if it's if we have to say it full length every time we cite it is it really a useful acronym I don't know I I can say it quite fast developers.google.com search developers.google.com search what will you summ if you say that three times in front of but not in front of a mirror maybe it's the end of the podcast I think it's is the end of podcast I thank you so much for answering all my questions regarding um multinational or multilanguage sites and atra Lang and um I learned something today w i that was not that was that was not my intention sorry I only learned how quickly you can say developers.google.com search developers.google.com and how many times without a verbal typ yeah verbal stumbling all right all right okay I I think that's it for this episode developers.google.com search oh developers.google.com search it's fine we know that you can say it so next time search of the record will be talking about what's wrong with crawling nowadays I hear so I'm looking forward to that and uh thanks to everyone listening out there and uh thanks to both of you being here with me and talking about at fling and stuff and uh goodbye juo May the snails be with you we've been having fun with these podcast episodes and we hope that you The Listener have found them both entertaining and insightful too feel free to drop us a note on Twitter @ Google search C or chat with us at one of the next upcoming events that we go to if you have any thoughts and of course don't forget to like And subscribe thank you so much and goodbye [Music]