Google Search Reliability | Search Off the Record

2024-10-03 · en automatic
[Music]
hello and welcome to another episode of
search off the record a podcast coming
to you from the Google search team
discussing all things search and having
some fun along the way my name is
sometimes Gary and I'm from the search
team I'm joined today by two guests uh
Ben Walton and David Ule from the Google
search
let's see if I can pronounce it site
reliability engineering team hi
both hi G would you like to introduce
yourselves a little bit perhaps going
with Ben first because that's
alphabetical yeah sure uh so my name is
Ben I'm the lead for the search platform
s teams and if you don't know who we are
that's a good thing because that means
we've kept search up and running for you
we're responsible for all the core
components in the stock hi and I'm David
Yu I'm also in the search platforms team
um and I've been in SRE for about nine
years now mean yeah all the time in
search wow long timers so we sometimes
work together when uh things go south
with search and then I sometimes pop up
in your inboxes or in your chats and
basically annoying you with questions
about is sege healthy or things like
that in general I have an idea about
what Sr is doing what search Sr is doing
let me tell you what I think you are
doing and then you tell me whether I'm
wrong or not does that sound good to you
yeah yeah cool so basically what I think
is that you are both the gandal of
search and you are using white magic to
basically keep search up and running by
employing your minions and Magic is that
accurate at all uh I think that's a bit
of a stretch
oh whenever anyone says magic to me it
probably frightens site reliability
Engineers because we lean on
understanding how things work and when
when there's a bit of the system that's
magic you can't handle anything that
that's bad because it's like okay can
you define it for me like what is your
magic yeah um I I'll have a go and then
I'll leave it to Ben to follow up the
main focus is we're software Engineers
just like the folk developing features
in search but our Focus day-to-day is
just working out how we can make web
search that bit more reliable that bit
safer and so there's a lot of work on
thinking about how to stop things going
wrong right ideally the job is you don't
have any issues at all it's all work
smoothly because of the work you do but
of course you the visible bit internally
is often when there are issues we're
we're the first people usually on the
line to try and work out what's going
wrong and and having a very clear set of
playbooks and tools that we can do use
to mitigate the problem right yeah and
so for me a sort of a core aspect of how
we perform our role well is we we have
to understand at a low level how things
work how they fit together so that we
can see how when we make changes they
won't work anymore and sort of drisk
that uh early so we do try to be very
proactive and forward-looking engaging
as many of the large changes as we can
but obviously we can't be everywhere
we're a very small group of people
relative to the number of people
shipping code at Google yeah so things
do sometimes break and and then we're
there to Unbreak them but basically what
you are trying to do is to keep Google
search and I guess all its features up
247 right sort of there's a saying in in
Sr that aiming for 100% reliability is
you're never going to do it you're
always going to have issues oh so one of
the things we look at is how reliable do
we need the service to be oh wow and
that varies from product to product so
for search obviously it's a very
high-profile service and so we keep
ourselves to a very high standard right
but that means we need to do more work
sometimes it can slow down development
because we have to be cautious rolling
things out so we have to choose exactly
what level of reliability we are aiming
for so so for example for Gmail or
Google Maps they might have a different
uh what we call SLO uh service level
objective than let's say Google Search
right yeah exactly and and you can think
about this you know from a business
perspective an additional nine can cost
an awful lot of money right uh you know
in software engineering and human time
system time resources to make that go so
we need to make smart trade-offs based
on what our users need and yeah and
what's best for Google that is
absolutely frightening to me how do you
get into this line of work like what
motivates you to become an sorry I I
didn't ever imagine myself here I've
just always been a tinkerer you know
Computer Science Background I love
programming but I love understanding how
things work and how to improve them and
make changes uh safely and thinking
about scale and and I I love debugging
other people's engineering and for me
there were two things that got me into
Sr I think one of them is I'm a bit of a
graph nerd so looking at monitoring
graphs and seeing you know things is
going in the right direction is
something I spend too much time doing
but the other thing was when I joined
Google I didn't join in site reliability
group I was an engineer in a different
area and the very first bit of code that
non-trivial bit of code that I submitted
to web search almost caus an out caused
an outage oh high five and so I had a
Frontline view of oh my oh my whatever
um i' broken the Google sort of thing
and I had a Frontline seat to seeing how
coolly and calmly the people people s of
handled it and how you know they had
tools to yeah to make sure it wasn't bad
and you know from then I was sort of
watching on and thinking yeah that's
that's a cool team to work for and yeah
I was I was really happy when I managed
to to move into working it yeah I I can
weirdly relate to that uh back in 2012
is I was working as a s for uh our
indexing systems and uh what we call
exteral caffine and uh I submitted a
change that uh broke news indexing and
it was a Friday change just like it's
written in the big book and I got a call
I was in Austria uh for a private trip
like a weekend trip and I got a call
from back then one of the VPS that uh
that was not very nice and they had to
roll back my change and how to say that
it was incredibly stressful which is my
next point that it like for for you
folks it must be incredibly stressful to
keep all the systems up so we can answer
billions of queries each day or do you
get used to it or is that not even even
the case or I'll share a few thoughts
and then maybe David can correct me or
or augment we'll see which which way he
goes but part of me thinks if you're not
a little bit terrified of being on call
for for Google search that you know
you're probably not you're too numb at
that point you need to be under your
toes and stay sharp because it's
changing all the time uh and you can't
rest on your laurels uh but but there is
is a you know a level of acceptance that
you get to with that uh where you know
you understand it can be stressful but
you know that you have a team around you
right you know that you can always you
know you're the captain of the ship when
there's something going wrong you you
can be the most Junior person on the
team and you can get directors to to go
get resources for you and and help fix
problems and right it's very very
powerful that way so you do come to it
accept that but I think if you're you're
not feeling that a little bit uh still
over time then then you know
that's a worrying signal for me yeah I
agree with Ben I mean but there is the
point that he mentioned being on call so
the majority of your time you're working
on normal project stuff so it's you know
what what are you going to achieve this
week not what might happen in the next
minute yeah so when when you're on call
and it's yeah maybe one6 of your time or
something like that um then there is a
little bit of stress there um but
outside of that there isn't and the
thing that makes it so much easier for
me is knowing you know if there is a big
issue people always appear people always
help out volunteer to help and so yeah I
mean it is it does feel like a team
sport yeah when the big issues arise
yeah I see that on the search SRI chat
that um um as soon as some some bigger
thing is happening then two three people
immediately show up and they are there
just to back the person who's the UN C
person takes away some of the stress as
well because you can rely on someone
else as's knowledge about systems and
how to debug stuff it's it's actually
one of one of the the needest aspects I
think of of working at the scale is that
nothing uh no human can fit all the
required knowledge in their head so you
do have to depend on your team very
heavily and and it has developed a great
culture of everyone is willing to help
all the time yeah which is pretty NE but
what happens if you let's say press the
wrong button and you I don't know erase
a data center for example like what
happens with your managers for example
like will you get fired or you get I
don't know a pay cut or or or something
you might actually get a pay bonus and
actually get paid more um we we've had
examples of this that in general the way
we try and think about it anyway is if
it's possible to you know type the wrong
command and bring down a major service
then then there's something wrong with
the system and the processes we have in
place so if you managed to do it you
found a problem in our system which we
can then fix and and genuinely we've had
a few cases where somebody has done
something which has been the trigger for
you know sometimes a major incident and
because they handled it well they got
everyone you know sort of complimenting
them for how they handled what happened
next it is it can be a stressful time
and yeah knowing that somebody May makes
a genuine mistake if they make a mistake
and it causes a problem then that's
something we can fix yeah that's pretty
awesome but I I'm trying to imagine what
you are doing every day because for for
me again like you're not going to change
my mind you are just basically gondal
who who knows everything and can fix
everything how does your day look like
from an actual workday perspective like
are you sitting in front of the computer
and waiting for an incident to happen or
you're doing I don't know writing
scripts or I I'll answer in two two ways
when I'm not on call so not the person
who's going to get the the first alert
when something happens I'm just doing
Project work so it's you know working on
that design dock making code changes and
and rolling them out so it's it's very
much very similar different type of work
but different same Cadence as for a
normal software developer when I'm on
call I
try and do the same thing with an
acceptance that I might get interrupted
and my day is gone because you know
something major has happened but yeah
you're never just fiddling your your
your fingers and waiting for a big
explosion when when we have new people
join the team I kind of try to set their
expectations you know you've got your
project time and you should isolate that
when you're not on call when you're not
handling interrupts and and the stuff
you know on on the front end of the the
pager you know focus on your project
work and get that done when you're on
call you know if the pag is quiet there
are other interrupts there are small you
know personal things that you could
drive and and Advance um but you know
really try to separate those two buckets
of time so that you're not disappointed
if the pag goes off uh that it's
interrupted your project work uh you
know that is your time go go find that
weird graph and and get nerd sniped into
digging into the next big problem
that'll save us millions of of failed
queries for users or something like that
uh and you do see that happen uh people
will use that time and they they'll turn
up the next interesting thing and now
we've got an awesome project for people
to work on for for some period of time
yeah it's you know it's not always that
that way we do plan projects they don't
just spawn themselves from graphs all
the time but you know H having that
mindset that you've got interrupt time
and you've got project time is is a
useful separation but is the
firefighting part still the core of the
work or it's more towards Dev work I I
think we skew more towards Project work
than than interrupt and firefighting you
know there there are periods in time uh
where you you feel like you're doing
more firefighting than you would like uh
but I I you know maybe 30% of our time
is is that David yeah that that feels
right about right I mean there is the
point that when you are getting those
interrupts and you're you're potentially
getting alerted it's it is harder to
switch back to project work so even when
you're on C you don't if if I look back
at a day as well there was only about an
hour when I was actually responding to
an incident but the rest of the day went
because you had to context switch a few
times yeah but it definitely skews much
more towards Project work okay we we
mentioned on call a few times already
can we Define what what does it mean to
be on call because probably most people
are not on call in general yeah so we
have one person who is the Prime uh
responder for one part of the Sur system
so they're the people who if our
monitoring notice is a problem they will
get an alert you know making their phone
beep and and the expectation is for us
that you will respond to that within a
couple of minutes right so so you have
to be you know ready and at your desk
that sort of thing but with the
understanding that this is stressful so
you do this for maybe three four days
and then you hand over to somebody else
and and they're they're the on caller
and then of course we do that with two
sites so we can do the 24-hour shift so
for us there's a site in Dublin and a
site in California you mentioned you
that uh phone beeping that's what we
would have called a few years ago pager
we we've got multiple uh so we're we're
SES and we always have more one one more
way than we actually need to to page
ourselves most people have uh we've got
an app for that we've got you know
paging and and you'll get a tele like a
text message and you'll get a telephone
call a telephone call what yeah wow is
it is it annoying um or it's supposed to
be annoying right well it's supposed to
get your attention I I you know for the
most part the the app is what gets my
attention quickly enough and I I don't
ever see the text or get the phone call
because I've already acknowledged the
page but all right and the Annoying bit
actually was thing that I learned I used
to have the same ringtone for when my my
phone goes off as when I get an alert
and so I suddenly found I was getting
stressed when my wife called me or
something like that so so change it so
you got a different tone for an alert
than your wife you you were basically
conditioned p in
response and then when you are let's say
that the p goes off and you are in
firefighting mode that's basically
running scripts and watching dashboards
with beautiful graphs and writing shell
commands I I don't I I can't actually
imagine like what you are doing I I
think that like the first thing you want
to do is kind of get a gut check on what
is the actual impact of this um is is it
big is it small do I need more help
immediately um and and so you know the
first few minutes on initial triage to
figure out is this a real thing or is
this a you know Al alarm um that that
kind of thing that's my first sort of
minute or two David I don't know what
you approach it as yeah so there's
trying to work out what's happening and
why you've been alerted is is the first
thing I think and and really
understanding that and then yeah and
then you do move on hopefully fairly
quickly to how can I mitigate it how can
I stop the bleeding is the is the phrase
we often use um that maybe a few years
ago was you know shell script space but
nowadays we we've tried to get it to a
few fairly standard mitigations which
most the time will work so you know we
we have tooling so it's a lot easier to
do we know notic this change has just
gone out can we roll that back so we're
in the state that we were 10 minutes ago
and there's a button for that and and
make it a little bit easier to use so
you don't have to write a a shell script
while you're all stressed and maybe get
that wrong yeah that makes sense no PE
people do do still do that but I think
it's it's typically after we've got the
mitigation in place when we're trying to
expedite changes or uh you know
scripting and querying is is still used
but definitely not on the front line and
it certainly has increased over the
years I'm trying to imagine in my head
an incident like something goes wrong I
imagine that there are
several uh levels to an incident because
like from my experience working with you
folks uh like some incidents go under
the radar even internally but like it
wouldn't show up in my inbox but then
some incidents they are extremely
visible in
externally like for example like
something happens with news indexing um
or fresh indexing am I right that there
are multiple levels also internally to
these incidents or you are just like
really good at hiding them well well so
ideally even the biggest ones are still
hard for most people to spot um but but
yeah we do try to classify based on user
impact or or Revenue impact or you know
different severity uh Dimensions uh so
again that can impact how you might
respond to something if something is
negligible impact you know you can take
your time debug it a little bit more
deeply if it's a huge impact you know
you've got to you know mitigation
mitigation mitigation you've got to
figure out how to stop that bleeding and
perhaps a very ignorant or even stupid
question but how do we know that there's
an incident like is it like on social
media we get lots of reports and then
one of the sres uh spots that or do we
have tools that go off or how how does
that look like so in general we aim to
make sure that we're the first people
who notice it with all due respect Gary
when you pop up and say people are
complaining that that there's a problem
with search we think oh no we've got it
wrong if if if you if you pop up you
know 30 minutes into our debugging and
we are yes we know we're working on it
then then that that's kind of s success
for right for the monitoring side anyway
um so yeah so we really focus on and and
we look at stats for you know how many
of the instance small to large did we
notice for first or how many did a user
report to and and if if a user reported
it then we often have okay so there's a
gap in our monitoring how can we fix
that so the next time something similar
happens we notice it and we don't need
to wait on users can it be as easy as uh
now I'm trying to think back of preg
Google period of the common Gary um and
I was managing servers uh for a hosting
company and one thing that I was looking
at obsessively is the error rate the
HTTP error rate in the front ends or the
servers that are serving front ends can
it be that simple also on on Google's
scale or it's way more nuanced uh so yes
yes and I would say oh so we we
definitely still care about HTTP error
rates and things like that but one of
the really cool things in search in in
my time here has been sort of the
evolution of thinking where yes we still
have that that foundational level of
care for for what you're getting as a
response but we're actually thinking a
lot more nuanced than fine grain are you
getting the right product experience
right now oh yeah and that really
requires sort of understanding not just
are we shipping something to you and
giving you a 200 uh response but is what
we ship to you correct and working
correctly oh wow and and that you know
we've really pushed the envelope on that
over the last I'm G to say five years to
me that's just mind-blowing um what what
do you think about going through an
actual incident that happened and uh see
how how it appeared on on your end would
that work for you sounds like fun sure
do you have an incident by any chance on
your mind because if I pick then that
would be painful one of my favorites I
think I think an Sr is allowed to have a
favorite incident one of my favorites
was uh during the uh football soccer
World Cup in 2022 so year and a half ago
we had issues where when there was were
were some of the matches on we we got
learn
and it was kind of one of these failures
which was a success failure to a certain
extent or we suddenly got way more
traffic than we were expecting yeah my
mental model before this was if there's
a match on you watch the TV watch the
match turns out people also search
especially when there's a goal they
search who scored what's the information
about the scorer and so we were seeing
these massive spikes of traffic whenever
anyone scored that's the one that sticks
out for me
it it it certainly uh sticks in a lot of
people's memories that was uh I think
maybe one of our best uses of of our
imag training our Incident Management at
Google H you know we put all those best
practices into play when that happened
and a lot of people contributed to
making that go okay that sounds cool um
then let's talk about that because it's
the World Cup uh I imagine that we do
some extra provisioning for those times
like when when we know that there's
something big happening then we add more
resources I imagine or more machines or
something like that yeah so I mean it
goes back to what we were talking about
at the beginning that if we got this
right then we'd have done all the work
six months in advance and predicted this
is how much traffic we're going to get
this is how how expensive to serve this
traffic is and make sure that we had
planned it well in advance and so we had
the capacity to serve it when you say
expensive are
you thinking of dollar expensive or
resource like machine resource expensive
I I generally think of well CPU
expensive so how how how expensive it is
for a machine to handle is because right
not all not all requests the same a
simple query which we've had exactly the
same one of maybe we'll be able to serve
it out of a cash and it'll be super
cheap from a Computing perspective but
yeah if we get it wrong then it gets
more and more expensive that was one of
the issues we faced in in this incident
was surprisingly CPU intensive to
serve most of these queries oh but I
imagine that you also do load testing
before releasing these spe special
features for stuff like the World Cup
that might also reveal stuff that
otherwise might go unnoticed I imagine
yeah there there's there was an awful
lot of planning that that went into this
and a lot of you know both projection on
on sort of the expected usage load
testing and cost profiling uh but it
turns out that cost Prof profiling
before the real event is not as easy as
we would like it to be yeah do we also
increase Staffing like do we get more
sres for that time in a in a room and
they we lock them in the room and now
you just watch these graphs until the
World Cup is
over uh no so so for Staffing it
was we've done the pre-work it should
just work so we will have one person who
will get paged if there's a problem and
you know we all we all rally around and
when it became clear that there was a
larger problem so yeah so in general we
try to say as long as we've done the
pre-work then you know we don't need to
have people just staring at grass all
the time and then how did we notice that
there's something going wrong this was a
great case of our our automated alerting
Cod it and and gave us early warning and
and and in particular the thing that it
warned us about was errors but it wasn't
errors for that were particularly
obvious to users so we we had so much
traffic that we were basically at our
limits for what we could serve but that
meant that we we have processes in place
where we will drop the lower priority
traffic so for example you know somebody
internal in Google is running some sort
of load test which loads our systems
that's the first thing that will just
get dropped on the floor if we have an
issue and then there's you know there's
some other lower Priority Services where
if it if it's a little bit flaky if it
fails
a couple of percent of the time then no
one will really notice and those those
were the next ones that go but then it's
at that level that we got alerted that
thing things are getting bad and could
get worse if if we don't deal with it
meaning that the Integrity of search is
at risk for example like would you say
that like for example a a feature like
the World Cup OB the onebox could that
affect search as a whole uh sure yeah I
mean I mean so if it had got much much
worse then we would have been serving
you know people would have searched the
score and they just got a an error
saying yeah sorry our Engineers are on
it um it didn't get to that level but
yeah I mean that's that's that's the
thing that you worry about when you get
these these alerts that sounds insanely
like literally I got stressed just
listening to you uh so what did we do to
to fix the issue because it is a very
meta issue in in my head because it's
like queries are becoming increasingly
expensive and in my brain that just
means that we throw more machines or
more CPU or more RAM in the pot and then
let it be expensive but apparently
that's not the case well that's part
partly the case you know where we're
able to up upscale things we would but
we would also look to reduce the costs
we would look to change how we're
managing the traffic uh many you know it
is a significant challenge there as you
say
so it was no one single solution in that
case yeah and and the thing that made it
sort of a lot of work was we do try and
have systems which will you know throw
more machines at the problem when we
start to notice we're full but this was
such an extreme Spike that you hit a
limit at some point yeah and one of the
things that we noticed is we saw this
about halfway through the tournament and
so we saw that we were hitting these
massive spikes of traffic and struggling
to serve them we were pretty confident
when the World Cup final comes around
that's going to be a bigger match
there's going to be bigger spikes and so
we had a a deadline of I think it was
about two weeks before we knew the
biggest game was was happening and so we
we had some time but not a lot of time
to sort of put in place a few you know
longer term mitigations to to make sure
that we could we could Ser things
smoothly and then if you think about
search search is not monolithic service
like basically it's not like just one
service running but probably hundreds if
not thousands of services running
together and then those Services being
orchestrated to serve users queries when
you say that you add resources you have
to find the actual service that is
starving right I'm trying to imagine how
would I go about like trying to find
which Services starving like where to
add resources and in in my head that
just seems impossible because we have so
many smaller Services running I imagine
you have graphs for that and yeah so so
in this case it was it was fairly
straightforward the alerting gave us
very direct signal as to where to look
for issues and and things like that
there there have over time been more
esoteric issues but they tend not to be
at the scale that is as significant as
what we were seeing there during the
World Cup okay let's see how else would
I fix issues Google has lots of data
centers I know from the SRE chat that
sometimes data centers are taken offline
for service or whatever maybe I could
add back one of those data centers to
help alleviate stress from other data
centers is that a possibility yeah
there's there were definitely sort of
things that we did around moving
resources around and making sure we were
using all the resources that we had
available and yeah I mean it was
actually kind of nice to see as as as
Ben said it was fairly obvious which
system which part of the system was
under the most stress so throw resources
at that and some of the systems were
actually totally fine and so we could
you know steal resources from one to
give the other oh yeah yeah strangely
enough the the the subcomponent which
just does Sports was huming along pretty
much fine because you know it it knew it
needed to to serve basic information
about this is the score and it had
caching set up so it could serve a huge
amount of traffic for that so it wasn't
that one it was one of the the the other
large compon ons and it's often that way
it's the one you focus on you get right
beforehand and it's it's the one next to
it that causes problems I think all this
chat that we had just reinforces me that
I don't really want to have want to be
and sorry it it's it's a fantastically
interesting role though like I'm going
to put a little pitch in here I I I
agree with that every day is different
you're solving puzzles well
unfortunately my day-to-day work is also
very um diverse um we we we
mentioned uh a few mitigation how is
that different from a fix from internal
perspective so the way I see it is a
mitigation is you know something very
shortterm to make sure that we are in a
vaguely healthy state but it's not a
long-term fix so you can do things like
as I mentioned you can roll back to the
state of the system say half an hour ago
but you can't just leave it there um you
you have to work understand what the
actual problem was and do the underlying
fixed before things start rolling
forward again right you do the
mitigations and then one of the big
things from an incident that we do and
and we did with this one was you then
write a postmortem afterwards of this is
what happened in detail these are all
the things that went really well these
are all the things that went really
badly and these are all the things that
we can fix so next time there's a big
sports event it happens without any SRE
knowing or caring concretely uh recently
someone in my team got paged uh they saw
that this was an issue affecting a
single data center the response was take
that data center offline it stopped
serving users stopped noticing any
potential impact uh so that's the
mitigation the the fixes is when we
identified the root cause of of why that
data center wasn't working and and
restored that uh to fully functioning
order and could put it back in service
right because users are routed to a
different data center if that route is
broken to whichever data center was
taken offline right yeah back to the
World Cup thingy um was this actually
noticed externally like did we get
complaints from actual users or we
didn't get any obvious complaints
because I mean we caught it there were
some errors but you would have to be
watching the HTTP requests going back
and forth to actually notice them um and
then yeah I mean the work we did meant
that during the World Cup final it was
actually nice and quiet and things
smoothly so it's kind of one of one of
the reasons I like is because it it yeah
had a happy ending I
guess SAR tweeted I think we set traffic
records during that event oh yeah um I I
actually have the Tweet a screenshot of
the Tweet here and uh Sundar said uh
search recorded its highest ever traffic
in 25 years during the final of the FIFA
World Cup some background questions um
let's say that I'm fresh out of school
and I decided that I want to become
specifically secher sorry how do I go
about it do you have any tips or tricks
to become Searcher sorry do you want to
take a stab at that David yeah so the
first thing to say is focus on the
engineering side because it is an
engineering role in terms of what you
need to know there's not that much
difference between developer and Sr but
then the thing I would focus on top of
that is you know are you the sort of
person who likes troubleshooting
something's broken and understanding why
it's broken and how to fix it and the
advantage is computers always break so
there's plenty of there's always plenty
of uh use cases that you can find to
what what what's going on here and
really drilling down and if you actually
enjoy that type of work then yeah Sr
might be a good role for you so so try a
few of those to get a feel that that
would be my my view Ben yeah so like
plus one uh engineering mindset uh
debugging tinkering playing with things
and systems is always is going to to get
you on the right path there I think um I
I would note you know we have a very
diverse group of people that work in in
SRE uh backgrounds uh where they're from
um my my mentor when I started was a
political science major um and and sort
of learned to be an engineer because he
liked to Tinker and play and and things
like that so you don't need a
traditional right Computer Science
Background either um it's it's you know
yes capabilities and and and knowledge
uh but uh mindset will get you a long
way too how de of a knowledge do you
have to have like do do you have to be
able to notice that some random bit was
flipped by cosmic ray or that's that's
that's way too low level for for your
holes we do debug right down into you
know hardware issues and CPUs wow um at
at times not not everyone is able to to
go that deep but you know kernel issues
uh network issues wow hardware issues we
we do debug down to that level okay and
now I'm even more scared of you yeah but
but the but the flip side is you know
one of the things that I think SRE has
is we we often have a more breadth to
what we look at so as a search SRE you
you end up looking at quite a few bits
of the web search stack if you're a
developer then you get a little bit more
depth and you get to be the expert on on
one part of system so so there is the
bread type of thing which means you have
to accept that you're not going to be
the expert all the time and you will
hand off to the to the colonel expert
who understands this and so you don't
need to be an expert in all these things
to be an SRE actually probably the soft
skills around you know communication and
collaboration are way more important for
an SRE than yeah right what are your
Linux skills uh um yeah final abilities
so yeah plus one awesome thank you very
much for joining me here today uh for
this chat it was frightening and um also
very eye openening well yeah and thank
you from me as well it's it's it's nice
to to get some publicity in a good way
yeah like likewise really enjoyed this
Gary thank
you we've been having fun with these
podcast episodes I hope you The Listener
have found them both entertaining and
insightful too feel free to drop us a
note on Twitter at Google search C or
chat with us at one of the next events
we go to if you have any thoughts and of
course don't forget to like And
subscribe thank you and goodbye
[Music]