Rendering JavaScript for Google Search | Search Off the Record
2024-07-11 ยท en automatic
[Music] hello and welcome to another episode of search of the record a podcast coming to you from the Google search team discussing all things search and having some fun along the way my name is Martin and I'm joined today by John from the search relations team of which I'm also part of hi John hi Martin and we are joined today by Zoe Clifford from the rendering team hi Zoe howdy Hey Zoe would you like to introduce yourself yeah I'm Zoe Clifford you may remember me from getting up on stage with Martin at google.io around 2019 or so I yeah work for Google bike to work work on rendering I like dogs and cats fun times that's it for me which one is better dogs or cats well you're you're going to make me choose between dogs and cats on a podcast John okay fine is it depends the answer uh so I have a favorite but I'll never admit which one it would make the other two sad that's totally just like Google okay so you're in the rendering team and I'm not sure everyone understands what rendering is about but we have the web you make a website you use HTML and CSS right am I missing something you are missing something Martin it's a scary word that's starts with j gifs gifs yes yes there can also be gifs on web pages as well as JavaScript JavaScript no it's not GIF it's okay it's JavaScript all right okay it's technically guavas cript gu no no it's JavaScript is guavas script actually useful do we need that for something yeah there there's many web pages out there that I'm quite fond of where if you try and load them without JavaScript you'll just get short string of text that says please enable JavaScript to access this web page fair so I know that there's a lot of websites especially when they use the wonderful term client side rendering that actually fetch their content using JavaScript and uh I guess we want to see the content to actually be able to index it no uh yeah it is generally useful to have the contents in the Dom to be able to index it o now we're using another fancy word the Dom the document object model so what's that what even is it all all I can tell you Martin is it's kind of like HTML but unwrapped into a tree form which reflects the browser's view of the page at rent time yeah it's like the browser's mental model of a website yeah but I I've never actually read the Dom spec so there could be something else about it that I've never heard of I'm not sure about that either now you make me question my my worldview that's that's that's something that's interesting okay so we using the Dom which is like the representation of all the content inside the browser and that can be changed and controlled by JavaScript is that roughly accurate yeah yeah that's right right and for that to be able to see things that have been manipulated added or removed by JavaScript we have to render right right right you can also have a Dom without any JavaScript at all fair that's true even static websites have a Dom yeah but then what is this rendering what happens inside Google search when we render a page okay so render is uh a very overloaded term but in this context it means headless browsing headless being a particularly gory industry term for a browser which is controlled by a computer and the reason we run a browser in the indexing pipeline is so we can index the view of the web page as a user would see it after it has loaded and JavaScript has executed okay interesting so I guess that involving a browser and having to kind of like run Pages through a browser is is pretty challenging no oh yeah it's very expensive it's so expensive the exact amount of expensiveness is highly confidential ah oh but then if it's Soo expensive how do we decide which page should get rendered and which one doesn't oh we just render all of them as long as they're HTML and not other content types like PDFs what but that that's expensive yeah yeah it is expensive but then if it's so expensive then then why can is it is it okay but we are rendering all the pages that are HTML Pages all of them get rendered right right right and it is expensive but that expense is required to get at the contents for the most part Pages which do not require JavaScript to index are cheap to render anyway so we don't think about it we just render all of them Ah that's really interesting fantastic and uh and I guess we have introduced I remember in 2019 when we were on this stage at iio we've introduced like the Evergreen Google bots so we are getting browser updates pretty regularly no that's correct uh we follow stable Chrome or stable chromium technically but that wasn't always the case why has that not been the case before 2019 that's a good question because before this effort to follow staple Chrome there was a lot of uh manual integration work to like take this normal browser core like blink and turn it into um a headless browser capable of running in the Google indexing pipeline uh and we kind of slacked a bit on browser updates and eventually the API we were using the blink platform API uh was deprecated and removed so we had to switch to something else and it's like I'm tired of all these manual updates we're just switching to chromium so basically before that we we had to install all the updates manually and now googlebot gets the updates fresh more or less yeah yeah uh we we were very careful to make sure we had this continuous integration I'm going to put that on my resume by the way continuous integration of uh Upstream chromium really really fancy that's really really nice in this bis you got to use words like continuous integration on your resume you can't just say I'm really good at installing updates you got to say cicd I still have to do these things manually I should get a John update that installs Chrome updates automatically you manually update your Chrome I thought that kind of does like happen in the background automatically no well is like constantly just well I mean constantly like every now and then this thing that it's like oh you have to update your browser and it's like oh gosh I have to spend 15 seconds restarting my browser so annoying but you get all the cool new browser features and you can build more interesting and amazing websites with it and as far as I understand that mostly then works with Google search uh mostly mostly so all all the systems that we've taken care to extract will for for sure keep working if there's like some new attribute or something we might not like look at it automatically but it won't like break anything for sure oh cuz we have tests to make sure that stuff doesn't break oh it was a terrible time Mar before we had all those tests things would just break and no one could stop them I mean I remember being a web developer back before 2019 when uh there was the big shift to es6 I think that was in 2015 and we got so many new features in JavaScript and we could use none of them because Google search wouldn't support them yeah at the time we were running an older version of blink with an older version of V8 so we had a lot of trouble with es6 and it it was a big problem which was one of the motivations for switching to continuous integration When you mention all these lowlevel browser Parts like blink which is the rendering engine in Chrome and then V8 was this Javas execution engine or rendering engine then uh there must have been scary things that you ran into uhuh yeah have I told you the ghost story of iterator iterator there was one day when we were updating our blink version and as part of this we had T know do some QA another thing to put on my resume to make sure that the new version actually worked for all the websites out there so you looked at all the pages on the web uh not all the pages we'd like divy up a bunch of pages with the most diffs and everyone would like get 10,000 pages each to kind of glance over it was a lot of fun you know I just spent hours and hours and hours just looking at web page diffs it was great but one of these diffs was like actually a really subtle difference there was just something on some Wiki article uh not Wikipedia one of the other wikis about um some TV series and part of the page just looks suddenly wrong to me so I open up console.log and I see a curious error message iterator Act is not defined that is probably not defined that that sounds like es 6.5 yeah so I thought maybe this is some kind of weird JavaScript keyword with a bizarre name so I used a search engine to search for it and there were zero results what and I tried again with all the other search engines I could think of and there were still zero results so then you made a page and now you rank I searched in the page and the page didn't reference it anywhere and I searched in the browser source code and it it wasn't referenced anywhere there either whoa it was a ghost in the machine a Ghost in the Shell where did it come from in the end it came from V8 V8 okay yeah uh so the code has changed since then but at the time V8 came with some bundled JavaScript files which has part of compiling the browser these JavaScript files would get pre-processed and shoved in into C arrays C arrays being kind of the C++ equivalent of data URLs but as part of this pre-processing there was a macro substitution step where it would substitute one string for another string and this macro substitution uh tried to substitute two strings at once only there was some overlap so if they were substituted in the wrong order this was indeterministic order because of python dictionary uh ordering then it would produce this bad output of iterator from iterator and object oh I couldn't tell you the exact details now but it was something like that if you search for my name in the creme commit log and it it's quite hard to find now but it's somewhere in there oh wow so your browser was hallucinating before hallucinating was cool yeah yeah uh so so that was some gnarly stuff there and that that was my first contribution to the chromium code base cool so one of the questions I I sometimes hear from people is whether it makes sense to implement uh structur data using JavaScript and the worry is sometimes is like it's too fragile or like Google hates JavaScript it's like of course they don't tell Martin that but they tell me that sometimes what do you think is implementing structured data with JavaScript is is that a problem does it work well how do you see that we're very good at executing JavaScript and I think javascript's great uh we mentioned a lot of problems with like es6 but now that we're following like normal cromium release schedule uh we basically get new JavaScript keywords for free and for the most part don't throw weird exceptions that won't Al so be thrown in the web that said it is possible for stuff to go wrong in particularly complicated scenarios uh for example if a web page is loading hundreds and hundreds of resources and it is possible that we won't always be able to fetch all the resources due to like crawl rate or HTTP errors or stuff like that so javascript's great but I'd also take some care to make sure that the web page isn't too fragile if errors do happen Okay so how do you mean fragile if errors happen uh like if you have a web page which accesses uh an API endpoint and that API endpoint could return of 429 under certain circumstances then this is one example of where things could go wrong if the return call there is critical and the page fails to have good contents without a successful resp from it okay and then what what happens do it does a page just stop loading or does everything get deleted it depends on the web page uh I've seen like partial page contents blank pages Pages which redirect to google.com um error messages if there's going to be like an error and you can't load the content I think it's best to have a clear error message but ideally it's best to have the contents of course okay and to so so I guess on the one hand the error Handler is is something that should be kind of reasonable and not crash the rest of the pages loading but yeah yeah uh like if there's an uncut exception because a video fails to load I've seen a case where a video fails to load so the page redirects to google.com actually wow um that's a popular redirect destination uh and this was a case where the page had good contents but then this tiny little thing went wrong so it's like I'm going to throw this all the away so if there is an error I just try and handle it as gracefully as possible and this is hard stuff don't get me wrong web development is hard stuff I'm not a web developer it like terrifies me I guess testing it is hard if it's sometimes breaks but if it always breaks what would you recommend like how how could someone test it to see if it's like generally possible that it could work there's this uh web master tool search console URL inspection tool that's great stuff if that works then generally it's possible that Google bot could also render it yes generally and rendering in Google is as close to a normal browser as possible Right but it's not quite the same is it yeah do do you want to hear another ghost story Martin oh please please do tell it's not quite the same and one of the ways it's different is we try and do things as efficiently as possible so efficiently that there's this certain JavaScript event that we were not firing called request idle call back because our Brower was never idle oh this is all well and good but there was a certain popular video website which I won't name to protect the guilty which um deferred loading any of the page contents until after request idle call back was fired this is actually a very reasonable thing to do you might want to you know get the video playing first and then load all the comments and stuff for example but since our browser was never actually idle this event was never fired so we couldn't load most of the page contents which was a problem for this website oh so now we fake being idle every once in a while just so paig has got better that that's one of the weird things that can happen when you have a browser that's mostly but not entirely like a normal browser so it has to be like Oh I'm I'm so bored and actually it's busy all the time what kind of things have have you noticed that people otherwise get wrong when it comes to rendering another common class of issues is called user agent Shenanigans Shenanigans being a technical industry term that's what we call in the bit what are US Asian Shenanigans Enlighten us so imagine you write a website and you're like I really really want Google in particular to be able to Index this web page so you're like okay I'll put in if statement if user agent header equals googlebot output go down this code path and output this HTML which I think will be really good for googlebot for some reason and this is all well and good it's tested it works but then here pass by the website changes may maybe it gets updated to a different framework or whatnot and there's just this code still lurking deep within it somewhere and it starts outputting HTML which is like uh broken or useless or missing contents or stuff like that and this is what I would call user agent Shenanigans we used to call that Dynamic rendering and we actually discouraging it now if that makes you a little happy ah so there is an industry term for it besides Shenanigans I think I ran across a case of this recently now now that you mention it like this uh so in in one of the help Forum threads someone uh was was mentioning that their their homepage title was wrong and I looked into it and it seemed that we were being redirected to a page that does a 404 uh but if you look at it in a browser it redirects to a page that's normal and uh in in the end I I noticed you could reproduce it by telling Chrome to use Google bot's user agent oh yeah I love that feature probably that that is happening in the background where someone is like oh I will be smart and do something special for googlebot and then the next person who works on the website is like I don't know I don't see anything wrong it works works for me yeah I I love the dev tools user agent override feature it's great for debugging stuff like this sometimes I'll even be trying to debug a web page and I change my user agent to Google bot and then it's like your access to this web page has been denied because you're doing you're using a suspicious user agent and I'm like no I wanted to debug this Shenanigan's gone wrong that's where they're being good and checking that the Google bot user agent comes from in a official IP address as recommended in the documentation but it it still makes it harder for me to debug so I cry a single tiar okay that's uh understandable understandable I would say how do you feel about JavaScript redirect so redirect is is kind of a topic in the SEO world where everyone has very strong opinions and JavaScript redirects kind of feels like that things like it's like even normal serers side redirects are this weird SEO myth topic and JavaScript redirect are like oh my gosh what do we even do with them what do we even do with them well we follow them so so they work just like normal redirects or for the most part JavaScript redirects of course have to happen at render time instead of crawl time but that's the pretty much the only thing special about them I don't think we like treat them differently in any way there have been cases where a web page gets into a JavaScript redirect Loop uh which is not very fun but okay yeah well I guess that happens with normal server side redirects from time to time as well where they're like oh you don't have a cookie it's like here's a cookie and then it checks again it's like oh you didn't take my cookie take another one and just keeps going forever our cookies do work pretty good though we have good cookies we have fairly Good Cookies yeah and in rendering do we also accept cookies or how how does that work do we accept cookies cookies are enabled if there's a cookie dialogue that says do you want to accept or deny these cookies we won't click either button we're Rogue like that we just don't make a decision but uh on the browser level cookies are enabled so if a web page you know sets a cookie without going through a dialogue then we'll see it okay but we don't keep that for the next time right uh no no rendering is stateless every time it happens it's a completely fresh browser session basically very very nice so if we're in the territory of like we're not clicking on cookie banners and and it's stateless I think when we fetch things we're using Google bot for that right so we do follow robots txt yeah yeah of course we follow robots txt that's the whole point of robots.txt but browser stoned uh yes but we're we're a search engine Martin okay fair enough yeah yeah that makes sense that makes sense okay fine fine but that means that if your API is roboted or disallowed for Google bot then rendering can't fetch API content right uh that's correct so we'll get the crawl which is like the HTML and that could be roboted but if it's not roboted and it's HTML it's sent to rendering and then rendering loads this in a browser which of course can make HTTP fetches to bunch of other stuff and any of those other resources could also be roboted if a resource is roboted we just can't fetch it we continue on with rendering the rest uh so if there's a API call you said and we can't fetch the API call then maybe that's okay if it wasn't doing anything important but if it was like fetching the page contents then we have a problem and I guess that's that's hard for us on on Google side to recognize because we don't know what the page is supposed to look like yeah I mean it is very reasonable for someone to just be like I don't want Google saying my content I'm just going to block this API call fair enough I'm totally okay with that but if it looks like a broken page it's uh can't be indexed the best way cool well this was super fun thanks for joining us Zoe oh yeah it's always a lovely time to hang out with my good pals John and Martin a thank you Zoe it's always good to talk to you and and rendering is such a fascinating topic and the wrs the r web rendering service such an amazing piece of software yeah the the last time I had a talk with Martin we were up on stage at Google IO and that is a blank spot in my memory I remember nothing of it I just remember getting up on stage and walking off of stage and that's it having a great time hopefully this was a great time as well and maybe you'll remember this one as well oh I hope so we'll send you a recording to remember John this has been search off the Record there's no record oh off the record of course yeah thank you so much Zoe for being here thank you John for joining me as well and um everyone out there thank you so much for being with us uh and I hope that this episode was interesting and fun and useful may your page indexes be contentful goodbye everybody goodbye bye we've been having fun with these podcast episodes and we hope that you The Listener have found them both entertaining and insightful too feel free to drop us a note on Twitter at@ Google search C or chat with us at one of the next upcoming events that we go to and of course don't forget to like And subscribe [Music]