Rendering JavaScript for Google Search | Search Off the Record

2024-07-11 · en automatic
[Music]
hello and welcome to another episode of
search of the record a podcast coming to
you from the Google search team
discussing all things search and having
some fun along the way my name is Martin
and I'm joined today by John from the
search relations team of which I'm also
part of hi John hi Martin and we are
joined today by Zoe Clifford from the
rendering team hi Zoe howdy Hey Zoe
would you like to introduce yourself
yeah I'm Zoe Clifford you may remember
me from getting up on stage with Martin
at google.io around 2019 or so I yeah
work for Google bike to work work on
rendering I like dogs and cats fun times
that's it for me which one is better
dogs or cats well you're you're going to
make me choose between dogs and cats on
a podcast John okay fine is it depends
the answer uh so I have a favorite but
I'll never admit which
one it would make the other two sad
that's totally just like
Google okay so you're in the rendering
team and I'm not sure everyone
understands what rendering is about but
we have the web you make a website you
use HTML and CSS right am I missing
something you are missing something
Martin it's a scary word that's starts
with j gifs gifs yes yes there can also
be gifs on web pages as well as
JavaScript JavaScript no it's not GIF
it's okay it's JavaScript all right okay
it's technically guavas cript
gu no no it's JavaScript is guavas
script actually useful do we need that
for something yeah there there's many
web pages out there that I'm quite fond
of where if you try and load them
without JavaScript you'll just get short
string of text that says please enable
JavaScript to access this web page fair
so I know that there's a lot of websites
especially when they use the wonderful
term client side rendering that actually
fetch their content using JavaScript and
uh I guess we want to see the content to
actually be able to index it no uh yeah
it is generally useful to have the
contents in the Dom to be able to index
it o now we're using another fancy word
the Dom the document object model so
what's that what even is it all all I
can tell you Martin is it's kind of like
HTML but unwrapped into a tree form
which reflects the browser's view of the
page at rent time yeah it's like the
browser's mental model of a website yeah
but I I've never actually read the Dom
spec so there could be something else
about it that I've never heard of I'm
not sure about that either now you make
me question my my worldview that's
that's that's something that's
interesting okay so we using the Dom
which is like the representation of all
the content inside the browser and that
can be changed and controlled by
JavaScript is that roughly accurate yeah
yeah that's right right and for that to
be able to see things that have been
manipulated added or removed by
JavaScript we have to render right right
right you can also have a Dom without
any JavaScript at all fair that's true
even static websites have a Dom yeah but
then what is this rendering what happens
inside Google search when we render a
page okay so render is uh a very
overloaded term but in this context it
means headless browsing headless being a
particularly gory industry term for a
browser which is controlled by a
computer and the reason we run a browser
in the indexing pipeline is so we can
index the view of the web page as a user
would see it after it has loaded and
JavaScript has executed okay interesting
so I guess that involving a browser and
having to kind of like run Pages through
a browser is is pretty challenging no oh
yeah it's very expensive it's so
expensive the exact amount of
expensiveness is highly confidential ah
oh but then if it's Soo expensive how do
we decide which page should get rendered
and which one doesn't oh we just render
all of them as long as they're HTML and
not other content types like PDFs what
but that that's expensive yeah yeah it
is expensive but then if it's so
expensive then then why can is it is it
okay but we are rendering all the pages
that are HTML Pages all of them get
rendered right right right and it is
expensive but that expense is required
to get at the contents for the most part
Pages which do not require JavaScript to
index are cheap to render anyway so we
don't think about it we just render all
of them Ah that's really interesting
fantastic and uh and I guess we have
introduced I remember in 2019 when we
were on this stage at iio we've
introduced like the Evergreen Google
bots so we are getting browser updates
pretty regularly no that's correct uh we
follow stable Chrome or stable chromium
technically but that wasn't always the
case why has that not been the case
before 2019 that's a good
question because before this effort to
follow staple Chrome there was a lot of
uh manual integration work to like take
this normal browser core like blink and
turn it into um a headless browser
capable of running in the Google
indexing pipeline uh and we kind of
slacked a bit on browser updates and
eventually the API we were using the
blink platform API uh was deprecated and
removed so we had to switch to something
else and it's like I'm tired of all
these manual updates we're just
switching to chromium so basically
before that we we had to install all the
updates manually and now googlebot gets
the updates fresh more or less yeah yeah
uh we we were very careful to make sure
we had this continuous integration I'm
going to put that on my resume by the
way continuous integration of uh
Upstream chromium really really fancy
that's really really nice in this bis
you got to use words like continuous
integration on your resume you can't
just say I'm really good at installing
updates you got to say cicd I still have
to do these things manually I should get
a John update that installs Chrome
updates automatically you manually
update your Chrome I thought that kind
of does like happen in the background
automatically no well is like constantly
just well I mean constantly like every
now and then this thing that it's like
oh you have to update your browser and
it's like oh gosh I have to spend 15
seconds restarting my browser so
annoying but you get all the cool new
browser features and you can build more
interesting and amazing websites with it
and as far as I understand that mostly
then works with Google search uh mostly
mostly so all all the systems that we've
taken care to extract will for for sure
keep working if there's like some new
attribute or something we might not like
look at it automatically but it won't
like break anything for sure oh cuz we
have tests to make sure that stuff
doesn't break oh it was a terrible time
Mar before we had all those tests things
would just break and no one could stop
them I mean I remember being a web
developer back before 2019 when uh there
was the big shift to es6 I think that
was in 2015 and we got so many new
features in JavaScript and we could use
none of them because Google search
wouldn't support them yeah at the time
we were running an older version of
blink with an older version of V8 so we
had a lot of trouble with es6 and it it
was a big problem which was one of the
motivations for switching to continuous
integration When you mention all these
lowlevel browser Parts like blink which
is the rendering engine in Chrome and
then V8 was this Javas execution engine
or rendering engine then uh there must
have been scary things that you ran into
uhuh yeah have I told you the ghost
story of iterator
iterator there was one day when we were
updating our blink
version and as part of this we had T
know do some QA another thing to put on
my resume to make sure that the new
version actually worked for all the
websites out there so you looked at all
the pages on the web uh not all the
pages we'd like divy up a bunch of pages
with the most diffs and everyone would
like get 10,000 pages each to kind of
glance over it was a lot of fun you know
I just spent hours and hours and hours
just looking at web page diffs it was
great but one of these diffs was like
actually a really subtle difference
there was just something on some Wiki
article
uh not Wikipedia one of the other wikis
about um some TV series and part of the
page just looks suddenly wrong to me so
I open up console.log and I see a
curious error message iterator Act is
not defined that is probably not defined
that that sounds like es
6.5 yeah so I thought maybe this is some
kind of weird JavaScript keyword with a
bizarre name so I used a search engine
to search for it and there were zero
results what and I tried again with all
the other search engines I could think
of and there were still zero results so
then you made a page and now you rank I
searched in the page and the page didn't
reference it anywhere and I searched in
the browser source code and it it wasn't
referenced anywhere there either whoa it
was a ghost in the machine a Ghost in
the Shell where did it come from in the
end it came from V8 V8 okay yeah uh so
the code has changed since then but at
the time V8 came with some bundled
JavaScript files which has part of
compiling the browser these JavaScript
files would get pre-processed and shoved
in into C arrays C arrays being kind of
the C++ equivalent of data URLs but as
part of this pre-processing there was a
macro substitution step where it would
substitute one string for another string
and this macro
substitution uh tried to substitute two
strings at once only there was some
overlap so if they were substituted in
the wrong order this was indeterministic
order because of python dictionary uh
ordering then it would produce this bad
output of iterator from iterator and
object oh I couldn't tell you the exact
details now but it was something like
that if you search for my name in the
creme commit log and it it's quite hard
to find now but it's somewhere in there
oh wow so your browser was hallucinating
before hallucinating was cool yeah yeah
uh so so that was some gnarly stuff
there and that that was my first
contribution to the chromium code base
cool so one of the questions I I
sometimes hear from people is whether it
makes sense to implement uh structur
data using JavaScript and the worry is
sometimes is like it's too fragile or
like Google hates JavaScript it's like
of course they don't tell Martin that
but they tell me that sometimes what do
you think is implementing structured
data with JavaScript is is that a
problem does it work well how do you see
that we're very good at executing
JavaScript and I think javascript's
great uh we mentioned a lot of problems
with like es6 but now that we're
following like normal cromium release
schedule uh we basically get new
JavaScript keywords for free and for the
most part don't throw weird exceptions
that won't Al so be thrown in the web
that said it is possible for stuff to go
wrong in particularly complicated
scenarios uh for example if a web page
is loading hundreds and hundreds of
resources and it is possible that we
won't always be able to fetch all the
resources due to like crawl rate or HTTP
errors or stuff like that so
javascript's great but I'd also take
some care to make sure that the web page
isn't too fragile if errors do happen
Okay so how do you mean fragile if
errors happen uh like if you have a web
page which accesses uh an API endpoint
and that API endpoint could return of
429 under certain circumstances then
this is one example of where things
could go wrong if the return call there
is critical and the page fails to have
good contents without a successful resp
from it okay and then what what happens
do it does a page just stop loading or
does everything get deleted it depends
on the web page uh I've seen like
partial page contents blank pages Pages
which redirect to google.com um error
messages if there's going to be like an
error and you can't load the content I
think it's best to have a clear error
message but ideally it's best to have
the contents of course okay and to so so
I guess on the one hand the error
Handler is is something that should be
kind of reasonable and not crash the
rest of the pages loading but yeah yeah
uh like if there's an uncut exception
because a video fails to load I've seen
a case where a video fails to load so
the page redirects to google.com
actually wow um that's a popular
redirect destination uh and this was a
case where the page had good contents
but then this tiny little thing went
wrong so it's like I'm going to throw
this all the away so if there is an
error I just try and handle it as
gracefully as possible and this is hard
stuff don't get me wrong web development
is hard stuff I'm not a web developer it
like terrifies me I guess testing it is
hard if it's sometimes breaks but if it
always breaks what would you recommend
like how how could someone test it to
see if it's like generally possible that
it could work there's this uh web master
tool search console URL inspection tool
that's great stuff if that works then
generally it's possible that Google bot
could also render it yes generally and
rendering in Google is as close to a
normal browser as possible Right but
it's not quite the same is it yeah do do
you want to hear another ghost story
Martin oh please please do tell
it's not quite the same and one of the
ways it's different is we try and do
things as efficiently as possible so
efficiently that there's this certain
JavaScript event that we were not firing
called request idle call back because
our Brower was never idle oh this is all
well and good but there was a certain
popular video website which I won't name
to protect the
guilty which um deferred loading any of
the page contents until after request
idle call back was fired this is
actually a very reasonable thing to do
you might want to you know get the video
playing first and then load all the
comments and stuff for example but since
our browser was never actually idle this
event was never fired so we couldn't
load most of the page contents which was
a problem for this website oh so now we
fake being idle every once in a while
just so paig has got better that that's
one of the weird things that can happen
when you have a browser that's mostly
but not entirely like a normal browser
so it has to be like Oh I'm I'm so bored
and actually it's busy all the time what
kind of things have have you noticed
that people otherwise get wrong when it
comes to rendering another common class
of issues is called user agent
Shenanigans Shenanigans being a
technical industry term that's what we
call in the bit what are US Asian
Shenanigans Enlighten us so imagine you
write a website and you're like I really
really want Google in particular to be
able to Index this web page so you're
like okay I'll put in if statement if
user agent header equals googlebot
output go down this code path and output
this HTML which I think will be really
good for googlebot for some reason and
this is all well and good it's tested it
works but then here pass by the website
changes may maybe it gets updated to a
different framework or whatnot and
there's just this code still lurking
deep within it somewhere and it starts
outputting HTML which is like uh broken
or useless or missing contents or stuff
like that and this is what I would call
user agent Shenanigans we used to call
that Dynamic rendering and we actually
discouraging it now if that makes you a
little happy
ah so there is an industry term for it
besides Shenanigans I think I ran across
a case of this recently now now that you
mention it like this uh so in in one of
the help Forum threads
someone uh was was mentioning that their
their homepage title was wrong and I
looked into it and it seemed that we
were being redirected to a page that
does a 404
uh but if you look at it in a browser it
redirects to a page that's normal and uh
in in the end I I noticed you could
reproduce it by telling Chrome to use
Google bot's user agent oh yeah I love
that feature probably that that is
happening in the background where
someone is like oh I will be smart and
do something special for
googlebot and then the next person who
works on the website is like I don't
know I don't see anything wrong it works
works for me yeah I I love the dev tools
user agent override feature it's great
for debugging stuff like this sometimes
I'll even be trying to debug a web page
and I change my user agent to Google bot
and then it's like your access to this
web page has been denied because you're
doing you're using a suspicious user
agent and I'm like no I wanted to debug
this Shenanigan's gone wrong that's
where they're being good and checking
that the Google bot user agent comes
from in a official IP address as
recommended in the documentation but it
it still makes it harder for me to debug
so I cry a single tiar okay that's uh
understandable understandable I would
say how do you feel about JavaScript
redirect so redirect is is kind of a
topic in the SEO world where everyone
has very strong
opinions and JavaScript redirects kind
of feels like that things like it's like
even normal serers side redirects are
this weird SEO myth topic and JavaScript
redirect are like oh my gosh what do we
even do with them what do we even do
with them well we follow them so so they
work just like normal redirects or for
the most part JavaScript redirects of
course have to happen at render time
instead of crawl time but that's the
pretty much the only thing special about
them I don't think we like treat them
differently in any way there have been
cases where a web page gets into a
JavaScript redirect
Loop uh which is not very fun but okay
yeah well I guess that happens with
normal server side redirects from time
to time as well where they're like oh
you don't have a cookie it's like here's
a cookie and then it checks again it's
like oh you didn't take my cookie take
another one and just keeps going forever
our cookies do work pretty good though
we have good cookies we have fairly Good
Cookies yeah and in rendering do we also
accept cookies or how how does that work
do we accept cookies cookies are enabled
if there's a cookie dialogue that says
do you want to accept or deny these
cookies we won't click either button
we're Rogue like that we just don't make
a
decision but uh on the browser level
cookies are enabled so if a web page you
know sets a cookie without going through
a dialogue then we'll see it okay but we
don't keep that for the next time right
uh no no rendering is stateless
every time it happens it's a completely
fresh browser session basically very
very nice so if we're in the territory
of like we're not clicking on cookie
banners and and it's stateless I think
when we fetch things we're using Google
bot for that right so we do follow
robots txt yeah yeah of course we follow
robots txt that's the whole point of
robots.txt but browser stoned uh yes but
we're we're a search engine Martin okay
fair enough yeah
yeah that makes sense that makes sense
okay fine fine but that means that if
your API is roboted or disallowed for
Google bot then rendering can't fetch
API content right uh that's correct so
we'll get the crawl which is like the
HTML and that could be roboted but if
it's not roboted and it's HTML it's sent
to rendering and then rendering loads
this in a browser which of course can
make HTTP fetches to bunch of other
stuff and any of those other resources
could also be roboted if a resource is
roboted we just can't fetch it we
continue on with rendering the rest uh
so if there's a API call you
said and we can't fetch the API call
then maybe that's okay if it wasn't
doing anything important but if it was
like fetching the page contents then we
have a problem and I guess that's that's
hard for us on on Google side to
recognize because we don't know what the
page is supposed to look like yeah I
mean it is very reasonable for someone
to just be like I don't want Google
saying my content I'm just going to
block this API call fair enough I'm
totally okay with that but if it looks
like a broken page it's uh can't be
indexed the best way cool well this was
super fun thanks for joining us Zoe oh
yeah it's always a lovely time to hang
out with my good pals John and Martin a
thank you Zoe it's always good to talk
to you and and rendering is such a
fascinating topic and the wrs the r web
rendering service such an amazing piece
of software yeah the the last time I had
a talk with Martin we were up on stage
at Google IO and that is a blank spot in
my memory I remember nothing of it I
just remember getting up on stage and
walking off of stage and that's it
having a great time hopefully this was a
great time as well and maybe you'll
remember this one as well oh I hope so
we'll send you a recording to remember
John this has been search off the Record
there's no record oh off the record of
course yeah thank you so much Zoe for
being here thank you John for joining me
as well and um everyone out there thank
you so much for being with us uh and I
hope that this episode was interesting
and fun and useful may your page indexes
be contentful goodbye everybody goodbye
bye we've been having fun with these
podcast episodes and we hope that you
The Listener have found them both
entertaining and insightful too feel
free to drop us a note on Twitter at@
Google search C or chat with us at one
of the next upcoming events that we go
to and of course don't forget to like
And subscribe
[Music]