Google Search Reliability | Search Off the Record
2024-10-03 ยท en automatic
[Music] hello and welcome to another episode of search off the record a podcast coming to you from the Google search team discussing all things search and having some fun along the way my name is sometimes Gary and I'm from the search team I'm joined today by two guests uh Ben Walton and David Ule from the Google search let's see if I can pronounce it site reliability engineering team hi both hi G would you like to introduce yourselves a little bit perhaps going with Ben first because that's alphabetical yeah sure uh so my name is Ben I'm the lead for the search platform s teams and if you don't know who we are that's a good thing because that means we've kept search up and running for you we're responsible for all the core components in the stock hi and I'm David Yu I'm also in the search platforms team um and I've been in SRE for about nine years now mean yeah all the time in search wow long timers so we sometimes work together when uh things go south with search and then I sometimes pop up in your inboxes or in your chats and basically annoying you with questions about is sege healthy or things like that in general I have an idea about what Sr is doing what search Sr is doing let me tell you what I think you are doing and then you tell me whether I'm wrong or not does that sound good to you yeah yeah cool so basically what I think is that you are both the gandal of search and you are using white magic to basically keep search up and running by employing your minions and Magic is that accurate at all uh I think that's a bit of a stretch oh whenever anyone says magic to me it probably frightens site reliability Engineers because we lean on understanding how things work and when when there's a bit of the system that's magic you can't handle anything that that's bad because it's like okay can you define it for me like what is your magic yeah um I I'll have a go and then I'll leave it to Ben to follow up the main focus is we're software Engineers just like the folk developing features in search but our Focus day-to-day is just working out how we can make web search that bit more reliable that bit safer and so there's a lot of work on thinking about how to stop things going wrong right ideally the job is you don't have any issues at all it's all work smoothly because of the work you do but of course you the visible bit internally is often when there are issues we're we're the first people usually on the line to try and work out what's going wrong and and having a very clear set of playbooks and tools that we can do use to mitigate the problem right yeah and so for me a sort of a core aspect of how we perform our role well is we we have to understand at a low level how things work how they fit together so that we can see how when we make changes they won't work anymore and sort of drisk that uh early so we do try to be very proactive and forward-looking engaging as many of the large changes as we can but obviously we can't be everywhere we're a very small group of people relative to the number of people shipping code at Google yeah so things do sometimes break and and then we're there to Unbreak them but basically what you are trying to do is to keep Google search and I guess all its features up 247 right sort of there's a saying in in Sr that aiming for 100% reliability is you're never going to do it you're always going to have issues oh so one of the things we look at is how reliable do we need the service to be oh wow and that varies from product to product so for search obviously it's a very high-profile service and so we keep ourselves to a very high standard right but that means we need to do more work sometimes it can slow down development because we have to be cautious rolling things out so we have to choose exactly what level of reliability we are aiming for so so for example for Gmail or Google Maps they might have a different uh what we call SLO uh service level objective than let's say Google Search right yeah exactly and and you can think about this you know from a business perspective an additional nine can cost an awful lot of money right uh you know in software engineering and human time system time resources to make that go so we need to make smart trade-offs based on what our users need and yeah and what's best for Google that is absolutely frightening to me how do you get into this line of work like what motivates you to become an sorry I I didn't ever imagine myself here I've just always been a tinkerer you know Computer Science Background I love programming but I love understanding how things work and how to improve them and make changes uh safely and thinking about scale and and I I love debugging other people's engineering and for me there were two things that got me into Sr I think one of them is I'm a bit of a graph nerd so looking at monitoring graphs and seeing you know things is going in the right direction is something I spend too much time doing but the other thing was when I joined Google I didn't join in site reliability group I was an engineer in a different area and the very first bit of code that non-trivial bit of code that I submitted to web search almost caus an out caused an outage oh high five and so I had a Frontline view of oh my oh my whatever um i' broken the Google sort of thing and I had a Frontline seat to seeing how coolly and calmly the people people s of handled it and how you know they had tools to yeah to make sure it wasn't bad and you know from then I was sort of watching on and thinking yeah that's that's a cool team to work for and yeah I was I was really happy when I managed to to move into working it yeah I I can weirdly relate to that uh back in 2012 is I was working as a s for uh our indexing systems and uh what we call exteral caffine and uh I submitted a change that uh broke news indexing and it was a Friday change just like it's written in the big book and I got a call I was in Austria uh for a private trip like a weekend trip and I got a call from back then one of the VPS that uh that was not very nice and they had to roll back my change and how to say that it was incredibly stressful which is my next point that it like for for you folks it must be incredibly stressful to keep all the systems up so we can answer billions of queries each day or do you get used to it or is that not even even the case or I'll share a few thoughts and then maybe David can correct me or or augment we'll see which which way he goes but part of me thinks if you're not a little bit terrified of being on call for for Google search that you know you're probably not you're too numb at that point you need to be under your toes and stay sharp because it's changing all the time uh and you can't rest on your laurels uh but but there is is a you know a level of acceptance that you get to with that uh where you know you understand it can be stressful but you know that you have a team around you right you know that you can always you know you're the captain of the ship when there's something going wrong you you can be the most Junior person on the team and you can get directors to to go get resources for you and and help fix problems and right it's very very powerful that way so you do come to it accept that but I think if you're you're not feeling that a little bit uh still over time then then you know that's a worrying signal for me yeah I agree with Ben I mean but there is the point that he mentioned being on call so the majority of your time you're working on normal project stuff so it's you know what what are you going to achieve this week not what might happen in the next minute yeah so when when you're on call and it's yeah maybe one6 of your time or something like that um then there is a little bit of stress there um but outside of that there isn't and the thing that makes it so much easier for me is knowing you know if there is a big issue people always appear people always help out volunteer to help and so yeah I mean it is it does feel like a team sport yeah when the big issues arise yeah I see that on the search SRI chat that um um as soon as some some bigger thing is happening then two three people immediately show up and they are there just to back the person who's the UN C person takes away some of the stress as well because you can rely on someone else as's knowledge about systems and how to debug stuff it's it's actually one of one of the the needest aspects I think of of working at the scale is that nothing uh no human can fit all the required knowledge in their head so you do have to depend on your team very heavily and and it has developed a great culture of everyone is willing to help all the time yeah which is pretty NE but what happens if you let's say press the wrong button and you I don't know erase a data center for example like what happens with your managers for example like will you get fired or you get I don't know a pay cut or or or something you might actually get a pay bonus and actually get paid more um we we've had examples of this that in general the way we try and think about it anyway is if it's possible to you know type the wrong command and bring down a major service then then there's something wrong with the system and the processes we have in place so if you managed to do it you found a problem in our system which we can then fix and and genuinely we've had a few cases where somebody has done something which has been the trigger for you know sometimes a major incident and because they handled it well they got everyone you know sort of complimenting them for how they handled what happened next it is it can be a stressful time and yeah knowing that somebody May makes a genuine mistake if they make a mistake and it causes a problem then that's something we can fix yeah that's pretty awesome but I I'm trying to imagine what you are doing every day because for for me again like you're not going to change my mind you are just basically gondal who who knows everything and can fix everything how does your day look like from an actual workday perspective like are you sitting in front of the computer and waiting for an incident to happen or you're doing I don't know writing scripts or I I'll answer in two two ways when I'm not on call so not the person who's going to get the the first alert when something happens I'm just doing Project work so it's you know working on that design dock making code changes and and rolling them out so it's it's very much very similar different type of work but different same Cadence as for a normal software developer when I'm on call I try and do the same thing with an acceptance that I might get interrupted and my day is gone because you know something major has happened but yeah you're never just fiddling your your your fingers and waiting for a big explosion when when we have new people join the team I kind of try to set their expectations you know you've got your project time and you should isolate that when you're not on call when you're not handling interrupts and and the stuff you know on on the front end of the the pager you know focus on your project work and get that done when you're on call you know if the pag is quiet there are other interrupts there are small you know personal things that you could drive and and Advance um but you know really try to separate those two buckets of time so that you're not disappointed if the pag goes off uh that it's interrupted your project work uh you know that is your time go go find that weird graph and and get nerd sniped into digging into the next big problem that'll save us millions of of failed queries for users or something like that uh and you do see that happen uh people will use that time and they they'll turn up the next interesting thing and now we've got an awesome project for people to work on for for some period of time yeah it's you know it's not always that that way we do plan projects they don't just spawn themselves from graphs all the time but you know H having that mindset that you've got interrupt time and you've got project time is is a useful separation but is the firefighting part still the core of the work or it's more towards Dev work I I think we skew more towards Project work than than interrupt and firefighting you know there there are periods in time uh where you you feel like you're doing more firefighting than you would like uh but I I you know maybe 30% of our time is is that David yeah that that feels right about right I mean there is the point that when you are getting those interrupts and you're you're potentially getting alerted it's it is harder to switch back to project work so even when you're on C you don't if if I look back at a day as well there was only about an hour when I was actually responding to an incident but the rest of the day went because you had to context switch a few times yeah but it definitely skews much more towards Project work okay we we mentioned on call a few times already can we Define what what does it mean to be on call because probably most people are not on call in general yeah so we have one person who is the Prime uh responder for one part of the Sur system so they're the people who if our monitoring notice is a problem they will get an alert you know making their phone beep and and the expectation is for us that you will respond to that within a couple of minutes right so so you have to be you know ready and at your desk that sort of thing but with the understanding that this is stressful so you do this for maybe three four days and then you hand over to somebody else and and they're they're the on caller and then of course we do that with two sites so we can do the 24-hour shift so for us there's a site in Dublin and a site in California you mentioned you that uh phone beeping that's what we would have called a few years ago pager we we've got multiple uh so we're we're SES and we always have more one one more way than we actually need to to page ourselves most people have uh we've got an app for that we've got you know paging and and you'll get a tele like a text message and you'll get a telephone call a telephone call what yeah wow is it is it annoying um or it's supposed to be annoying right well it's supposed to get your attention I I you know for the most part the the app is what gets my attention quickly enough and I I don't ever see the text or get the phone call because I've already acknowledged the page but all right and the Annoying bit actually was thing that I learned I used to have the same ringtone for when my my phone goes off as when I get an alert and so I suddenly found I was getting stressed when my wife called me or something like that so so change it so you got a different tone for an alert than your wife you you were basically conditioned p in response and then when you are let's say that the p goes off and you are in firefighting mode that's basically running scripts and watching dashboards with beautiful graphs and writing shell commands I I don't I I can't actually imagine like what you are doing I I think that like the first thing you want to do is kind of get a gut check on what is the actual impact of this um is is it big is it small do I need more help immediately um and and so you know the first few minutes on initial triage to figure out is this a real thing or is this a you know Al alarm um that that kind of thing that's my first sort of minute or two David I don't know what you approach it as yeah so there's trying to work out what's happening and why you've been alerted is is the first thing I think and and really understanding that and then yeah and then you do move on hopefully fairly quickly to how can I mitigate it how can I stop the bleeding is the is the phrase we often use um that maybe a few years ago was you know shell script space but nowadays we we've tried to get it to a few fairly standard mitigations which most the time will work so you know we we have tooling so it's a lot easier to do we know notic this change has just gone out can we roll that back so we're in the state that we were 10 minutes ago and there's a button for that and and make it a little bit easier to use so you don't have to write a a shell script while you're all stressed and maybe get that wrong yeah that makes sense no PE people do do still do that but I think it's it's typically after we've got the mitigation in place when we're trying to expedite changes or uh you know scripting and querying is is still used but definitely not on the front line and it certainly has increased over the years I'm trying to imagine in my head an incident like something goes wrong I imagine that there are several uh levels to an incident because like from my experience working with you folks uh like some incidents go under the radar even internally but like it wouldn't show up in my inbox but then some incidents they are extremely visible in externally like for example like something happens with news indexing um or fresh indexing am I right that there are multiple levels also internally to these incidents or you are just like really good at hiding them well well so ideally even the biggest ones are still hard for most people to spot um but but yeah we do try to classify based on user impact or or Revenue impact or you know different severity uh Dimensions uh so again that can impact how you might respond to something if something is negligible impact you know you can take your time debug it a little bit more deeply if it's a huge impact you know you've got to you know mitigation mitigation mitigation you've got to figure out how to stop that bleeding and perhaps a very ignorant or even stupid question but how do we know that there's an incident like is it like on social media we get lots of reports and then one of the sres uh spots that or do we have tools that go off or how how does that look like so in general we aim to make sure that we're the first people who notice it with all due respect Gary when you pop up and say people are complaining that that there's a problem with search we think oh no we've got it wrong if if if you if you pop up you know 30 minutes into our debugging and we are yes we know we're working on it then then that that's kind of s success for right for the monitoring side anyway um so yeah so we really focus on and and we look at stats for you know how many of the instance small to large did we notice for first or how many did a user report to and and if if a user reported it then we often have okay so there's a gap in our monitoring how can we fix that so the next time something similar happens we notice it and we don't need to wait on users can it be as easy as uh now I'm trying to think back of preg Google period of the common Gary um and I was managing servers uh for a hosting company and one thing that I was looking at obsessively is the error rate the HTTP error rate in the front ends or the servers that are serving front ends can it be that simple also on on Google's scale or it's way more nuanced uh so yes yes and I would say oh so we we definitely still care about HTTP error rates and things like that but one of the really cool things in search in in my time here has been sort of the evolution of thinking where yes we still have that that foundational level of care for for what you're getting as a response but we're actually thinking a lot more nuanced than fine grain are you getting the right product experience right now oh yeah and that really requires sort of understanding not just are we shipping something to you and giving you a 200 uh response but is what we ship to you correct and working correctly oh wow and and that you know we've really pushed the envelope on that over the last I'm G to say five years to me that's just mind-blowing um what what do you think about going through an actual incident that happened and uh see how how it appeared on on your end would that work for you sounds like fun sure do you have an incident by any chance on your mind because if I pick then that would be painful one of my favorites I think I think an Sr is allowed to have a favorite incident one of my favorites was uh during the uh football soccer World Cup in 2022 so year and a half ago we had issues where when there was were were some of the matches on we we got learn and it was kind of one of these failures which was a success failure to a certain extent or we suddenly got way more traffic than we were expecting yeah my mental model before this was if there's a match on you watch the TV watch the match turns out people also search especially when there's a goal they search who scored what's the information about the scorer and so we were seeing these massive spikes of traffic whenever anyone scored that's the one that sticks out for me it it it certainly uh sticks in a lot of people's memories that was uh I think maybe one of our best uses of of our imag training our Incident Management at Google H you know we put all those best practices into play when that happened and a lot of people contributed to making that go okay that sounds cool um then let's talk about that because it's the World Cup uh I imagine that we do some extra provisioning for those times like when when we know that there's something big happening then we add more resources I imagine or more machines or something like that yeah so I mean it goes back to what we were talking about at the beginning that if we got this right then we'd have done all the work six months in advance and predicted this is how much traffic we're going to get this is how how expensive to serve this traffic is and make sure that we had planned it well in advance and so we had the capacity to serve it when you say expensive are you thinking of dollar expensive or resource like machine resource expensive I I generally think of well CPU expensive so how how how expensive it is for a machine to handle is because right not all not all requests the same a simple query which we've had exactly the same one of maybe we'll be able to serve it out of a cash and it'll be super cheap from a Computing perspective but yeah if we get it wrong then it gets more and more expensive that was one of the issues we faced in in this incident was surprisingly CPU intensive to serve most of these queries oh but I imagine that you also do load testing before releasing these spe special features for stuff like the World Cup that might also reveal stuff that otherwise might go unnoticed I imagine yeah there there's there was an awful lot of planning that that went into this and a lot of you know both projection on on sort of the expected usage load testing and cost profiling uh but it turns out that cost Prof profiling before the real event is not as easy as we would like it to be yeah do we also increase Staffing like do we get more sres for that time in a in a room and they we lock them in the room and now you just watch these graphs until the World Cup is over uh no so so for Staffing it was we've done the pre-work it should just work so we will have one person who will get paged if there's a problem and you know we all we all rally around and when it became clear that there was a larger problem so yeah so in general we try to say as long as we've done the pre-work then you know we don't need to have people just staring at grass all the time and then how did we notice that there's something going wrong this was a great case of our our automated alerting Cod it and and gave us early warning and and and in particular the thing that it warned us about was errors but it wasn't errors for that were particularly obvious to users so we we had so much traffic that we were basically at our limits for what we could serve but that meant that we we have processes in place where we will drop the lower priority traffic so for example you know somebody internal in Google is running some sort of load test which loads our systems that's the first thing that will just get dropped on the floor if we have an issue and then there's you know there's some other lower Priority Services where if it if it's a little bit flaky if it fails a couple of percent of the time then no one will really notice and those those were the next ones that go but then it's at that level that we got alerted that thing things are getting bad and could get worse if if we don't deal with it meaning that the Integrity of search is at risk for example like would you say that like for example a a feature like the World Cup OB the onebox could that affect search as a whole uh sure yeah I mean I mean so if it had got much much worse then we would have been serving you know people would have searched the score and they just got a an error saying yeah sorry our Engineers are on it um it didn't get to that level but yeah I mean that's that's that's the thing that you worry about when you get these these alerts that sounds insanely like literally I got stressed just listening to you uh so what did we do to to fix the issue because it is a very meta issue in in my head because it's like queries are becoming increasingly expensive and in my brain that just means that we throw more machines or more CPU or more RAM in the pot and then let it be expensive but apparently that's not the case well that's part partly the case you know where we're able to up upscale things we would but we would also look to reduce the costs we would look to change how we're managing the traffic uh many you know it is a significant challenge there as you say so it was no one single solution in that case yeah and and the thing that made it sort of a lot of work was we do try and have systems which will you know throw more machines at the problem when we start to notice we're full but this was such an extreme Spike that you hit a limit at some point yeah and one of the things that we noticed is we saw this about halfway through the tournament and so we saw that we were hitting these massive spikes of traffic and struggling to serve them we were pretty confident when the World Cup final comes around that's going to be a bigger match there's going to be bigger spikes and so we had a a deadline of I think it was about two weeks before we knew the biggest game was was happening and so we we had some time but not a lot of time to sort of put in place a few you know longer term mitigations to to make sure that we could we could Ser things smoothly and then if you think about search search is not monolithic service like basically it's not like just one service running but probably hundreds if not thousands of services running together and then those Services being orchestrated to serve users queries when you say that you add resources you have to find the actual service that is starving right I'm trying to imagine how would I go about like trying to find which Services starving like where to add resources and in in my head that just seems impossible because we have so many smaller Services running I imagine you have graphs for that and yeah so so in this case it was it was fairly straightforward the alerting gave us very direct signal as to where to look for issues and and things like that there there have over time been more esoteric issues but they tend not to be at the scale that is as significant as what we were seeing there during the World Cup okay let's see how else would I fix issues Google has lots of data centers I know from the SRE chat that sometimes data centers are taken offline for service or whatever maybe I could add back one of those data centers to help alleviate stress from other data centers is that a possibility yeah there's there were definitely sort of things that we did around moving resources around and making sure we were using all the resources that we had available and yeah I mean it was actually kind of nice to see as as as Ben said it was fairly obvious which system which part of the system was under the most stress so throw resources at that and some of the systems were actually totally fine and so we could you know steal resources from one to give the other oh yeah yeah strangely enough the the the subcomponent which just does Sports was huming along pretty much fine because you know it it knew it needed to to serve basic information about this is the score and it had caching set up so it could serve a huge amount of traffic for that so it wasn't that one it was one of the the the other large compon ons and it's often that way it's the one you focus on you get right beforehand and it's it's the one next to it that causes problems I think all this chat that we had just reinforces me that I don't really want to have want to be and sorry it it's it's a fantastically interesting role though like I'm going to put a little pitch in here I I I agree with that every day is different you're solving puzzles well unfortunately my day-to-day work is also very um diverse um we we we mentioned uh a few mitigation how is that different from a fix from internal perspective so the way I see it is a mitigation is you know something very shortterm to make sure that we are in a vaguely healthy state but it's not a long-term fix so you can do things like as I mentioned you can roll back to the state of the system say half an hour ago but you can't just leave it there um you you have to work understand what the actual problem was and do the underlying fixed before things start rolling forward again right you do the mitigations and then one of the big things from an incident that we do and and we did with this one was you then write a postmortem afterwards of this is what happened in detail these are all the things that went really well these are all the things that went really badly and these are all the things that we can fix so next time there's a big sports event it happens without any SRE knowing or caring concretely uh recently someone in my team got paged uh they saw that this was an issue affecting a single data center the response was take that data center offline it stopped serving users stopped noticing any potential impact uh so that's the mitigation the the fixes is when we identified the root cause of of why that data center wasn't working and and restored that uh to fully functioning order and could put it back in service right because users are routed to a different data center if that route is broken to whichever data center was taken offline right yeah back to the World Cup thingy um was this actually noticed externally like did we get complaints from actual users or we didn't get any obvious complaints because I mean we caught it there were some errors but you would have to be watching the HTTP requests going back and forth to actually notice them um and then yeah I mean the work we did meant that during the World Cup final it was actually nice and quiet and things smoothly so it's kind of one of one of the reasons I like is because it it yeah had a happy ending I guess SAR tweeted I think we set traffic records during that event oh yeah um I I actually have the Tweet a screenshot of the Tweet here and uh Sundar said uh search recorded its highest ever traffic in 25 years during the final of the FIFA World Cup some background questions um let's say that I'm fresh out of school and I decided that I want to become specifically secher sorry how do I go about it do you have any tips or tricks to become Searcher sorry do you want to take a stab at that David yeah so the first thing to say is focus on the engineering side because it is an engineering role in terms of what you need to know there's not that much difference between developer and Sr but then the thing I would focus on top of that is you know are you the sort of person who likes troubleshooting something's broken and understanding why it's broken and how to fix it and the advantage is computers always break so there's plenty of there's always plenty of uh use cases that you can find to what what what's going on here and really drilling down and if you actually enjoy that type of work then yeah Sr might be a good role for you so so try a few of those to get a feel that that would be my my view Ben yeah so like plus one uh engineering mindset uh debugging tinkering playing with things and systems is always is going to to get you on the right path there I think um I I would note you know we have a very diverse group of people that work in in SRE uh backgrounds uh where they're from um my my mentor when I started was a political science major um and and sort of learned to be an engineer because he liked to Tinker and play and and things like that so you don't need a traditional right Computer Science Background either um it's it's you know yes capabilities and and and knowledge uh but uh mindset will get you a long way too how de of a knowledge do you have to have like do do you have to be able to notice that some random bit was flipped by cosmic ray or that's that's that's way too low level for for your holes we do debug right down into you know hardware issues and CPUs wow um at at times not not everyone is able to to go that deep but you know kernel issues uh network issues wow hardware issues we we do debug down to that level okay and now I'm even more scared of you yeah but but the but the flip side is you know one of the things that I think SRE has is we we often have a more breadth to what we look at so as a search SRE you you end up looking at quite a few bits of the web search stack if you're a developer then you get a little bit more depth and you get to be the expert on on one part of system so so there is the bread type of thing which means you have to accept that you're not going to be the expert all the time and you will hand off to the to the colonel expert who understands this and so you don't need to be an expert in all these things to be an SRE actually probably the soft skills around you know communication and collaboration are way more important for an SRE than yeah right what are your Linux skills uh um yeah final abilities so yeah plus one awesome thank you very much for joining me here today uh for this chat it was frightening and um also very eye openening well yeah and thank you from me as well it's it's it's nice to to get some publicity in a good way yeah like likewise really enjoyed this Gary thank you we've been having fun with these podcast episodes I hope you The Listener have found them both entertaining and insightful too feel free to drop us a note on Twitter at Google search C or chat with us at one of the next events we go to if you have any thoughts and of course don't forget to like And subscribe thank you and goodbye [Music]