When the Internet Breaks: Bugs and Outages Artwork

Tech Overflow

We're Tech Overflow, the podcast that explains tech to smart people. Hosted by Hannah Clayton-Langton and Hugh Williams.

All Episodes

Tech Overflow

When the Internet Breaks: Bugs and Outages

October 05, 2025 • Hannah Clayton-Langton and Hugh Williams • Season 1 • Episode 4

Send us a text

Catastrophic software failures can seem like acts of chaos, but behind every major tech outage lies a story of human decisions, technical constraints, and cascading consequences. The July 2024 CrowdStrike incident—which Hannah describes as "the single biggest outage in the history of computing"—offers a perfect case study into what happens when critical systems fail.

Hannah and Hugh dive deep into how a seemingly minor error (a file with 21 fields when the software expected 20) managed to crash millions of Windows computers worldwide, grounding flights, shutting down hospitals, and causing billions in economic damage. Hugh walks us through the technical underpinnings of why this particular failure was so devastating—CrowdStrike's Falcon security software runs deeply embedded within Windows, making a simple mismatch catastrophic rather than merely inconvenient.

The conversation explores the safeguards that many companies use that could have prevented this disaster: progressive rollouts, chaos engineering (Netflix's deliberately disruptive "Chaos Monkeys"), and fuzz testing that generates random inputs to break systems before they reach production. Hugh shares war stories from his own career, including a nine-hour eBay search outage that cost millions and a Google Maps bug that inadvertently became an international incident when labels disappeared from politically sensitive regions.

What's particularly fascinating is the cultural side of managing technical risk. The most resilient organizations have moved beyond blame to create environments where finding bugs is celebrated rather than punished. Hugh and Hannah discuss how former military personnel often excel in operations roles during crises, bringing calm structure to chaotic situations, and why the best tech companies are working toward systems so resilient that engineers being woken up at night is becoming unnecessary.

Whether you're part of tech or tech-enabled company or simply curious about the infrastructure powering our lives, this episode reveals the balance between innovation speed and operational stability that every technology organisation must navigate. How do you move fast without breaking things? How do you recover when systems inevitably fail? And what separates organisations that learn from failure from those doomed to repeat it?

If you’ve enjoyed this episode, please like, subscribe, or follow Tech Overflow and share it with your friends and colleagues.

Hannah Clayton-Langton: 0:05

Hello world and welcome to the Tech Overflow podcast, where we have smart tech conversations for smart people. I'm Hannah Clayton Langton and after nearly seven years working in tech companies, I decided that I wanted to understand a little bit more about what was going on around me, so I enlisted my co-host, Hugh, to take me on a technical journey. Hugh, how are you? How's the jet lag?

Hugh Williams: 0:24

I am well, hannah. It's day six, so it usually takes me nine days to get over the jet lag, but it's fabulous to be here in London with you, actually in the same room, without a 200 millisecond time delay. And for those of you who don't know me, hugh Williams is my name. I'm a former vice president at Google. I was a vice president also at eBay and I was a senior engineer at Microsoft, so my job here today is to help demystify tech topics for Hannah.

Hannah Clayton-Langton: 0:50

Perfect, okay. So I'm super excited for today's episode because it feels like a proper under the hood topic, which is what we're calling bugs and outages. So what happens when it goes wrong and what sort of systems are in place so that things don't go wrong? Because, as end users, we definitely notice when things don't work, but we expect them to work all the time.

Hugh Williams: 1:10

I can feel that sort of stressful feeling that I used to get when I was a vice president in my stomach already of an outage or a bug in the CEO calling me. But I'm looking forward to the conversation Amazing.

Hannah Clayton-Langton: 1:21

Well, we'll try not to stress you out too much, so let's start with probably the most notable outage that I reckon most of the listeners will have been aware of it was about 11 months ago, I think which was the CrowdStrike outage. So maybe, before we get into the sort of nuts and bolts of bug and outage management, can you just talk us through what happened there, because I think that's a really good real-life example of when tech goes wrong.

Hugh Williams: 1:45

I remember when it happened and as an engineer, I think we're a bit like a brotherhood or sisterhood at some level. We always feel for each other and I remember thinking somebody's having a really, really bad day and gee, I'm glad that's not me. We'll probably talk about some of my stories later on, of things that have gone wrong, but I definitely felt for the engineering team over there at CrowdStrike, so let's pull it apart, it'll be fun.

Hannah Clayton-Langton: 2:09

Yeah, so when an outage happens, that's basically like an engineering fault or mistake, right, like my understanding is, it's probably someone's deployed something new like an update to the code and there's been an unintended consequence of what they rolled out and it breaks something. Is that like a fair, generic assessment?

Hugh Williams: 2:28

I think that's fair. I think one thing to remember, though, is it's not always the fault of the folks that you think it is the fault of right. So so let's imagine that one of your favorite websites goes down tomorrow. Could indeed be those folks. So it could indeed be the folks that are, you know, building Instagram or whatever it is that you're using. It could also be the folks that are hosting the service that that runs on. So you know, let's imagine that Instagram runs on AWS. That's built by Amazon. It could be an AWS outage that's causing it. So not the fault of the folks on Instagram.

Hugh Williams: 3:01

Right and then there's all sorts of internet infrastructure in the way, right. So, for example, there's these things called DNS servers. We'll talk about that some other time but that's how your computer, when you're using your web browser, figures out exactly where the machines are that it needs to talk to. So you're used to typing in an English thing like Instagramcom and pressing enter. That gets turned into some numbers behind the scenes, and there's these things called DNS servers that do that conversion. So if the DNS server is unavailable, your browser can't turn the words into numbers, and then you'll think that Instagram's down, but it's, you know, might in that case be absolutely nothing to do with the folks at Instagram. But definitely, yes, you're right. I mean, coming back to your first point, a lot of the outages are caused by folks making mistakes who are actually building the products.

Hannah Clayton-Langton: 3:45

Yeah, well, that's what's kind of terrifying right, because I make mistakes all the time and I don't think I've ever quite had as bad a day as whatever that engineer at CrowdStrike did From a technical perspective. Like is it clear now what actually happened?

Hugh Williams: 3:57

I think it's reasonably clear. Like a lot of these outages and bugs, a whole bunch of things went wrong. So let's start with what actually was the thing that broke. So CrowdStrike has a product called Falcon, and Falcon is a product that's installed on users' computers mostly. It can also be installed on machines in a data center that are running software that's critical to an organization, and what Falcon does is it basically inspects the internals of the computer to see if nefarious, malicious things, patterns of those things, are starting to happen.

Hannah Clayton-Langton: 4:32

Because CrowdStrike is at its core, like a security software security service provider Okay yeah.

Hugh Williams: 4:37

And this is one of the most popular products. So you can imagine if we worked at a large company, a company that uses a lot of tech, a tech-enabled kind of company, we might say. In our security department we should get CrowdStrike Falcon installed on all of our computers and it will monitor behaviors that are going on on all of our computers and then if it detects something that has the potential to be malicious, it'll take some action right.

Hannah Clayton-Langton: 4:58

Okay, so this is like super important fundamental software that I'm guessing a whole bunch of companies are using, based on the variety of companies that went down on that day, so it's pretty commonly used, right.

Hugh Williams: 5:09

Very popular software and you can imagine us making the decision that we want all of our end users in our business to have this software right, so that if they click on a malicious website or they open an email they shouldn't open or they try to install some software they shouldn't install, that this system's sitting there deep inside their Windows machine making sure that there's some extra protection there.

Hugh Williams: 5:29

So the first thing to know about this Falcon software and Microsoft Windows is that the Falcon software runs really I guess in layperson's terms as part of Windows. It needs to run really deep inside the machine because it's got to inspect the whole machine, so it's looking for all sorts of behaviors that might be occurring within the machine. So it's not like something like Microsoft Word that you install on top of the operating system, which means it runs in a very safe kind of way. This is actually something that's running deep inside Windows and actually has a lot of control over what's going on inside the computer. So that makes it very, very dangerous. So if something goes wrong in Falcon, something is going to likely go wrong inside Windows and you in this case, ended up with the blue screen of death, right.

Hannah Clayton-Langton: 6:13

Okay, so sorry not working. End of story.

Hugh Williams: 6:18

End of story. A file was installed deep inside this Falcon software, and this file didn't have the contents that the Falcon software expected. The CrowdStrike folks have deployed this file onto all of the computers in the world that run this software. The Falcon software has opened up the file. It's expected the file to have certain contents. It didn't have those contents, and so the Falcon systems actually crashed.

Hannah Clayton-Langton: 6:43

Okay, but that file contents. It wasn't a malicious content, it was just like literally different to what the computer was expecting and that sort of caused like a fault.

Hugh Williams: 6:54

I know many of our listeners will be familiar with things like Microsoft Excel or Google Sheets or maybe even common delimited files. I think lots of folks import common delimited files into Excel and Sheets, and so the kind of thing that happened here was that there was a file and this file was expected to have a certain number of fields, which I think was 21. But the software was only expecting 20. It hadn't been updated to expect the full 21 fields, and so it opened up the file. It found it had 21 fields. The software is expecting 20, and all sorts of bad things started to happen.

Hannah Clayton-Langton: 7:28

And so because it's so deeply embedded, it wasn't just like error, please restart, or error couldn't read file type. You ended up sort of tripping the whole system and blue screen of death right means can't use your computer, that's right.

Hugh Williams: 7:42

Exactly Because if something like Word had a problem, like this, word would crash.

Hannah Clayton-Langton: 7:47

Yeah.

Hugh Williams: 7:47

And you'd say, huh, word's crashed. I'll try starting it again. Huh, it keeps crashing. Maybe I'll try downloading a new version, or I'll wait till tomorrow, till Microsoft updates it. But because this Falcon software runs deep inside the operating system, it actually took down the operating system, this error, and so all these blue screens of death started happening. So CrowdStrike folks deploy this file and they basically shut down every Windows machine that this software is installed on. They all get the blue screen of death. Of course, what happens after the blue screen of death is a lot of folks will try and reboot the machine. So they say, oh, reboot. But the problem was when it booted back up, the same thing happened again.

Hannah Clayton-Langton: 8:24

And every Windows system across the world that had Falcon installed basically went black or went blue.

Hugh Williams: 8:32

And was unusable and would not boot up again.

Hannah Clayton-Langton: 8:35

And so, in case any listeners don't know exactly the incident we're talking about, I remember it because it was like British airways went down, like healthcare systems were going down and people were really in crisis, like the first thing. I remember my WhatsApp group chat sliding up and everyone was saying it was this sort of huge international hack. Obviously it wasn't a hack, but that's sort of you end up in a panic state when everything starts going down around you, right?

Hugh Williams: 8:58

Yeah, one of my good friends always says you have to choose between conspiracy and incompetence. Pick incompetence every time.

Hannah Clayton-Langton: 9:04

But conspiracy is much more interesting.

Hugh Williams: 9:06

Conspiracy is much more interesting.

Hannah Clayton-Langton: 9:07

Okay, and so I think I also read two facts that I found interesting. On this one, 5% of all flights that day didn't go, which is a huge proportion of flights globally if you think about the economic impact, which I'm sure we'll talk about later, and the single biggest outage in the history of computing and IT. So really, really bad day.

Hugh Williams: 9:28

I think some countries have even tried to compute how many billions of dollars of damage this caused. I think you'd need a major consulting firm to figure it out because of the cascading issues of all the economic damage that it did, but it's probably economically the most significant outage that there's ever been.

Hannah Clayton-Langton: 9:42

Yeah, and, as you say, it was a CrowdStrike error. But of course most of the companies affected by this basically didn't have a plan for if things went down. They sort of trusted that it would work 100% of the time.

Hugh Williams: 9:57

Yeah, which is pretty naive, right? So if we were let's not pick on any particular airline but if we're a major airline, you know, one of the top companies in the world hundreds, if not thousands of planes in the air you would expect, I think, your chief information officer, or whoever runs your technology team, to probably have some processes that make you resilient against these kinds of issues right Like it is possible that Windows gets into a state where it continually reboots and so you would think that they'd have some process where they could, you know, remotely re-image the machine with a safe version from last week or whatever it is, or you know there'd be processes in place that could actually get you back into a known state and you could recover from.

Hugh Williams: 10:38

But I think just about every company that went down kind of pointed, quite reasonably, I guess, at CrowdStrike and said what have these folks done to us? Went down kind of pointed quite recently, I guess, at CrowdStrike and said what have these folks done to us? Some folks also pointed at Windows and said hang on a minute, like why is this software effectively running as part of Microsoft Windows? Like really You're letting that happen. So there was a bit of pointing going on, but I'm not sure quite enough. Companies looked at themselves and said hang on, you know we're responsible for providing this service. You know why aren't we resilient enough against these?

Hannah Clayton-Langton: 11:06

kinds of issues. Yeah, that makes sense. And if we take a step back to bugs more generally for a second or unintended consequences of a rollout at work, I often see us talking about rolling back the change. So you deploy something. It doesn't work as expected and you can do two things. You can fix it really quickly or you can roll back, and sometimes, if it's in a real like panic state, it's just like the quickest thing is going to be to roll back, which I think essentially is like hitting undo. Right, you just go roll back the change, get things to how they were before, but it sounds like that wouldn't have worked in this instance, that's right, and I think you know there's different levels, I guess, of how much control you have.

Hugh Williams: 11:43

So if we're working at a major internet company, let's go back to Instagram. We're deploying our software on machines that we control, so we can be a little bit more free and loose, right, because if we mess something up, these are machines that we control. We can rectify whatever occurred on those machines by rolling back or fixing forward, and we'll talk about that a little bit more, I'm sure, in a moment. But remember, this is a situation where this company is putting an update out there and every Windows machine that's out there that's running this software is effectively sucking that update down onto that machine and CrowdStrike doesn't have access to the machines.

Hannah Clayton-Langton: 12:19

So it's like a one-way street.

Hugh Williams: 12:20

It's a one-way street, so you would think in this situation, that you know the bar, if you like, for the quality of the updates and the care that needs to be taken needs to be very, very high, because it's a one-way street.

Hannah Clayton-Langton: 12:32

Well, that's my next question, because I did some research on this ahead of this episode and I read something that said they'd only tested the happy path. And I wanted to bring up the happy path for the non-technical listeners because I find it to be quite a neat concept and so, from what I understand, like the happy path is like when everything works, so you're basically testing that the code you're shipping performs as expected in a situation where it's basically encountering everything as it should be working the happy path and it sort of makes sense to me that you would want to test a few of the less happy paths, because you know things happen and and I read that that was one of the sort of diagnoses as to what went wrong was they hadn't tested this code rollout in a situation where everything around it wasn't functioning as expected. Is that?

Hugh Williams: 13:13

right, that's fair. That's fair. And you know, if I go back to the mid 2000s, when I was at Microsoft, I mean we had software engineers and we had software engineers in test. So then there were two separate disciplines. So the software engineers built software and the software engineers in test would try to break software. It's quite different DNA actually. I think people are born as builders or breakers, and the folks who end up in the breaking half of the house are pretty special people. I remember, you know, being at Microsoft catching up with one of the software engineers who was in test and we were going to go down to the cafe and have lunch. True story this person put a book on top of their keyboard and I'm like why'd you put a book on top of the keyboard? And they're like I just want to see what happens.

Hugh Williams: 13:52

if you know, random characters get entered into this form for an hour and see what breaks, and then off we went and I think it's a special kind of DNA right To sort of have that mindset of I will just do things to try and break things, and so you've sort of got to have two halves of this story right.

Hugh Williams: 14:06

You've got to have people who build, and people who build don't necessarily think as clearly about breaking as the second half, which are the people who break. You know, I grew up as a software engineer who builds things, so I wouldn't call myself an expert in breaking things. But the folks who do the breaking, you know, treat this very much as a discipline. So let's imagine you and I are building a calculator. The folks who are empowered with breaking the calculator are going to do all sorts of funny things to our calculator. So first thing they're going to do is try dividing by zero. So does dividing by zero cause the calculator to crash or does dividing by zero cause it to come up and say, oh, undefined if you divide something by zero? So they're going to do things like that.

Hugh Williams: 14:49

You know they're going to type in numbers with a decimal point but no numbers after the decimal point, and see what happens. You know they're going to try multiplying really, really large numbers together that can't be displayed on the calculator and say, well, is this going to? You know what happens when the number's too big. So they're going to think of all the things that are outside the happy path right of just normally using a calculator, and they're going to build software that exercises that path. And then, when the calculator breaks, a well-run company will say awesome, we found a bug, this isn't a bad thing, it's a good thing. And then they'll file the bug in some system and you know that will ultimately get rectified. But these people really are thinking about breaking things.

Hannah Clayton-Langton: 15:24

Okay, rectified, but these people really are thinking about breaking things. Okay, and how in like a software because you've worked in, obviously, some quite well-known software companies, but like, how long does that testing phase take?

Hugh Williams: 15:31

A couple of weeks Longer, so if you sort of go back to our product management episode, where we sort of talked about waterfall and we talked about the variants of agile if this were a waterfall company, then there is a testing phase. We're going to cost that out like we're costing out the building. So we're going to say, well, what are all the scenarios that we want to test? If we're running a more Agile process you know we're running sort of these one or two week or four week sprints then testing is going to be part of those sprints, right? So we're going to build some features that are part of our product and then we're going to try and break those features as part of the product and when that's finished, then we'll say, okay, crash all the bugs, and then the software is ready to go. So it's just going to be part of these very, very short cycles too.

Hannah Clayton-Langton: 16:11

Let's talk a bit more about the breakers, like breaking software. Okay, this is sort of a coordinated function set up by a business. Have you ever seen that done in a particularly interesting or useful way?

Hugh Williams: 16:24

Yeah, a couple of stories for you, so maybe we could talk about chaos monkeys and chaos engineering, which sounds like it's a fun topic.

Hannah Clayton-Langton: 16:29

Chaos monkeys, yeah.

Hugh Williams: 16:30

Another thing we could talk about is fuzz testing. Okay, where do you want to start?

Hannah Clayton-Langton: 16:33

Chaos monkeys.

Hugh Williams: 16:34

Chaos monkeys. So imagine. So I want you to sort of get our listeners to kind of close their eyes and just imagine. Imagine there's monkeys let loose in a data center full of computers and these monkeys' jobs is to run around randomly turning off computers. So imagine that as a concept. That's actually an idea that somebody at Netflix had in the early 2010s, and so the idea was why don't we build software that randomly turns off computers and then we'll make sure that our systems are resilient against that happening?

Hugh Williams: 17:06

Because if you go and look at a modern data center whether it's a Google data center or a Microsoft data center or an Amazon data center they're made up of very cheap computers. In the old days we used to have very reliable mainframe computers. Today we have very cheap computers that are very unreliable. So somebody at Netflix said, huh, I guess we should build software that's resilient against these machines effectively being turned off. And so they wrote some software that would randomly turn off random computers at random times, and the expectation was that the engineering team built software that was resilient against that. So there was a big AWS outage an Amazon AWS outage. I think it was in 2015.

Hannah Clayton-Langton: 17:40

And AWS is the cloud computing provider that probably supports a lot of Netflix, yeah, and.

Hugh Williams: 17:44

Netflix runs on top of that. Yeah, when that outage happened, it was chaos. So all over the globe, lots of these companies are built on top of AWS.

Hannah Clayton-Langton: 17:55

And so suddenly, all of your favorite services stop running. Most tech companies right are built on top of AWS.

Hugh Williams: 17:58

And guess who didn't go down Netflix? Those tech companies right are built on top of. Aws. And guess who didn't go down Netflix? Okay, because they'd now had five years of history of these chaos monkeys turning off computers, and so, guess what, they're really good. When data centers went down, sure yeah. They then had other chaos ideas. So what happens? Why don't we go and fill up disk drives? We'll have a chaos engineering tool that goes and just randomly fills up disk drives and then we'll see what happens.

Hannah Clayton-Langton: 18:18

What does that mean in practice?

Hugh Williams: 18:20

That means you can't save anything to the machine. So suddenly the machine is full, has no further capacity. So now, what? Now, what do we do now that the computer is full?

Hannah Clayton-Langton: 18:29

And have other companies followed suit, like have they now set the standard for this chaos engineering?

Hugh Williams: 18:33

Yeah, they actually. They open sourced it, which means that they made it publicly available all their chaos engineering tools, which means that they made it publicly available all their chaos engineering tools, which is super cool. I mean great, great thing for a company to do. Good publicity for them, right.

Hannah Clayton-Langton: 18:43

Makes it easier for them to hire engineers. Well, their whole thing is like agile and they sort of lead the way right.

Hugh Williams: 18:47

So I guess, it fits with their brand. Yeah, and they had this, all this stuff around, how you could take unlimited leave.

Hannah Clayton-Langton: 18:51

Yeah, I've read the book no Rules, Rules yeah yeah, so they made it publicly available.

Hugh Williams: 18:57

It's open sourced and then you can actually go and use these chaos engineering tools now from Netflix, and so I think that's probably lifted the resilience of the whole of the internet now. Wow.

Hugh Williams: 19:08

Yeah, which is amazing, which I guess everyone stands to benefit from, right, yeah absolutely, and I think you know going engineers if you go and talk to an individual engineer, they're motivated by that stuff right, like they like to help other engineers. You know, as a sisterhood or brotherhood I think, between engineers and engineering leaders, you know we're all largely doing the same thing and I think helping each other is something that most folks are pretty interested in.

Hannah Clayton-Langton: 19:28

And so what's fuzz testing? Is that a similar concept?

Hugh Williams: 19:31

Yeah. So fuzz testing would have helped the folks at CrowdStrike for sure. So fuzz testing is basically generate lots and lots of random data and see what happens. So maybe let's put this in the context of Microsoft Word or Microsoft Excel. So imagine that on your disk drive on your computer there's all sorts of Word documents appearing that are fictional, right, so they're not structured in the way that a Word document should be, so there's something broken about them.

Hugh Williams: 19:59

Maybe the table or the heading has a bug in it that might cause Word to crash. So instead of the file being properly formatted, it's got a formatting issue. And then when you try and open it, what happens? Does Word gracefully deal with that or does Word crash? And so if you generate enough of these sort of fictional fake files, you might find some issues with Word when it tries to open those files. So imagine we're now at CrowdStrike with this Falcon product. We would have been generating hundreds, thousands, tens of thousands of different files and causing Falcon to open those files, and of course we would have found the kind of issue that they actually found in production. So it's really just about generating random data and having that data being inputted into the systems that we're building so a very popular way these days of testing the kinds of issues that the CrowdStrike folks faced.

Hannah Clayton-Langton: 20:49

So they're both chaos engineering or chaos. Monkeys and fuzz testing are both just about like robust testing.

Hugh Williams: 20:56

Yeah, just chaos, Chaos. You know You'll even see in some companies. You know I used to work on Google Maps at Google when we had a room full of every possible piece of hardware that could run Google Maps, so every smartphone you could ever think of, like smart watches.

Hugh Williams: 21:12

Every Apple iPhone watches every possible Android device of which there's a you know, it's the Wild West We'd have Google Maps installed on all of those and we'd have all of those carrying out certain actions in Google Maps and then we'd be able to understand if there's any particular issues with Google Maps on any particular devices. So I think these large scale companies you know really are, you know they've got chaos, they've got random data, they've got, you know, environments where they're constantly testing all of the possible outcomes that customers could have, and that gives you a lot of sort of defence, if you like, against issues arising in practice.

Hannah Clayton-Langton: 21:44

Okay, I've got a few follow-up questions. So does better code have fewer bugs? Like is that? If you get a lot of bugs in something, is it a sign that you've done it too quickly or not robustly enough?

Hugh Williams: 21:55

I think that's generally fair. I think you know one or two things can be going on if you don't find many bugs. So one is the happy path, which is wow, we're building really robust software. You know, look at us, go, that's fantastic. The other thing that can be happening is we're not testing it well enough or we've got a culture where bugs are bad. Some companies have a culture where they say if we find too many bugs, we need to punish some people. Well-run companies don't do that. Well-run companies say finding bugs is awesome. That means that we're really stress testing things in a way that we should.

Hugh Williams: 22:25

Our test team's working really, really well and it's really about making sure that those get dealt with and dealt with in a really systematic way, so I'm not sure that bug count is necessarily a good measure of quality.

Hannah Clayton-Langton: 22:37

It's an interesting balance between your risk appetite and speed, or risk appetite and innovation.

Hugh Williams: 22:44

I'd say probably a better thing to track is do you fix the bugs within some SLA? So you're going to have some agreement in your company about how fast bugs should be repaired and that's probably going to depend on what we call the severity of the bug should be repaired and that's probably going to depend on what we call the severity of the bug.

Hannah Clayton-Langton: 23:05

So at one end there's probably P3 or P4 bugs which we just frankly don't care about.

Hugh Williams: 23:09

P stands for priority, priority, so really low priority bugs where you might say, look, it'd be nice if we fix this at some point, but this is so minor that let's just not enforce an SLA. As you kind of move up the tree, you know a P2 bug, you might say, look, we have to fix this within a month. A P1 bug, we might say we have to fix it within a day or a week or whatever it is. And then a P0 bug would be we're not doing anything until this bug's fixed. So all tools down, nobody's doing anything until we actually get this thing rectified Because it's an outage right yeah or or.

Hugh Williams: 23:37

It's so significant that you know it's impeding our customers or our users in doing something significant.

Hannah Clayton-Langton: 23:42

You know it's a legal issue or it's an embarrassment to the company or whatever it is right so it sounds to me like, if I'm thinking about my flat as a metaphor, a p3 is like I've scuffed the wall. We might never fix that. And then like a p zero is like a pipe, is like actively flooding and so we're not doing anything if the pipe's broken.

Hugh Williams: 24:04

I think that's a. I think that's great okay.

Hannah Clayton-Langton: 24:06

So I'm really interested in one. It all goes wrong, like for the crowd strike teams on that day in, like july august last year. Like I know that we have on-call engineers. I've got a friend who's a surgeon and like he has the on-call phone and if someone needs an emergency, who's a surgeon and he has the on-call phone. And if someone needs an emergency surgery in the night, can't wait till the morning, he gets called in. And I'm guessing that's pretty similar for the on-call engineers. Right, they're the ones that have to crisis manage.

Hugh Williams: 24:29

Yeah, that's right, and I think that health analogy is pretty good too. It's a nice analog, right. So I think if a patient comes into emergency, if this patient presents in a certain way, let's try steps one, two, three and four and let's record what happens, and then, if we're not in a happy state after that, then-.

Hannah Clayton-Langton: 24:48

If you can't stabilize, then you need to bring someone in.

Hugh Williams: 24:50

Yeah, then we go and find a specialist. We bring in the surgeon you know, admit them to intensive care, whatever it is. So there's a series of activities that are going on, and those activities involve different groups of people, and so I think in tech companies, when something goes wrong, quite often the first line of defense is some type of operations team. They'll get alerted first, but they may find it themselves. They might say a dip in a graph or whatever it is some behavior, and then they'll almost literally pull out the plastic card that says oh, if this is occurring, then try the following steps. If that doesn't solve the issue, then they're going to wake somebody up, and usually they would wake somebody up who is related to the area where the issue is.

Hugh Williams: 25:33

So let's imagine we're running an e-commerce site and customers can't pay for things. So that operations team is going to know that they need to talk to the payments team and they're going to know who's on call, and they're going to wake somebody up who's on call within the payments team. Now, that person is going to be a random person from the payments team, so they might not know all the subtle details of how the master card payments work. They'll try a set of activities and if that doesn't work, ultimately they're going to go find the person who's the expert and get them involved. And so, yep, there's this front line of defense, yeah, there's on-call folks, but ultimately, when things get really serious, the expert tends to get involved in the end.

Hannah Clayton-Langton: 26:13

Okay, and so if you are a software engineer, is it very typical that you'll have to opt into being on call every week or once a month, or is that something that is reserved for only people who volunteer?

Hugh Williams: 26:26

So it's not usually reserved for folks who volunteer. So usually it's something that you know. It has a rotor that goes around and around and around, but I would say that this is probably only common now in sort of your mid-tier tech enabled companies. And so if I was the vice president of engineering and I walked into a company in a new job and I found out that we had a pager rotation and that everybody's spending time on call, I would say our software is not resilient enough.

Hannah Clayton-Langton: 26:52

That means you're expecting something to go wrong. Yeah.

Hugh Williams: 26:55

So it's going to be something about how we've built the software. It's not defensive enough. It's also going to be something about sort of, perhaps our software engineering processes, where we're not building things to a level of quality that allows us all to sleep well at night. So if you go to the big tech giants you know Google, or I used to work or Microsoft you're not going to find that these days. So these days people sleep well in general.

Hannah Clayton-Langton: 27:18

In general. Well, because I once, many years ago, went out on a date with a guy who was a software engineer at a really well-known tech company that everyone listening, including you, will have heard of this podcast is taking an interesting turn now, but anyway he mentioned at the beginning of this date that he was on call and I was like, well, you can't have, can I have a drink Like you're on call?

Hannah Clayton-Langton: 27:35

And he laughed at me and he just was, like I can have a drink and still be on call. And I was thinking, well, if you were a doctor, you definitely couldn't have a drink and be on call.

Hannah Clayton-Langton: 27:42

Yeah, I'm not sure he had quite a few drinks, and I can tell you that by the end of the night, if something had gone wrong or maybe his shift had ended Hugh, I'm not sure about it this sounds like a mid-tier tech company that's not run super well. Okay, well, I'll tell you after we start recording where he was working. But that always fascinated me because to me that's quite a big responsibility and I presume people get. Do they get paid more to be on call, or is that Certainly?

Hugh Williams: 28:12

here in the UK. Yes, and I've been working with a few companies in the UK, as you know, and folks in some of those companies actually rely on those payments to you know, make their mortgage and those kinds of things. So it's a real culture of these extra payments and on-call rotations here in the UK. But I'd say that's an unusual thing in, you know, the large tech companies in the US.

Hannah Clayton-Langton: 28:32

Here's another scenario for you, thinking back to CrowdStrike, or maybe we can draw on some of your experiences of outages, because they will happen, that's as sure as anything. Do people pile into a room? How does it work? It's all systems go. What are we doing?

Hugh Williams: 28:48

Yep. So there's almost always going to be what they call a bridge call some sort. So there's going to be a call that you can dial into and there will be a set of people on that call. There's probably a communications channel that goes with that, so something like a Slack or a Teams messaging group. So there's going to be a central place where all the key people are having conversations about what they're doing to rectify the situation. I've found in some of those situations that as a leader I've had to take charge of the call or whatever it is, provide a little bit of structure for the call. But generally as a leader you don't know enough details to be able to actually resolve the situation yourself. But it's a great way to kind of listen to the team, understand where we are in the resolution of the issue and use that information and report that information out and about to the rest of the organization.

Hannah Clayton-Langton: 29:34

Yeah, because there's a lot of skills that you need in that situation, right, Like you need someone who's good at communication and structure and who has a sense of sort of momentum and urgency. But they might not be the same. It'd be great if they were, but they might not be the same person who's really technically skilled to be able to assess I don't know the error messages or the patterns of behavior and think about what it could be that's causing the issue. So, and it's sort of like there's this whole crisis element of maybe if it's really serious, as everyone's in a panic mode. So you need a quite a collection of skills to be good at that right.

Hugh Williams: 30:05

Yeah, absolutely, and I think you know a good operations team is good at putting structure around that. You know, when I used to work in in tech in the US, often we'd employ people who were ex-US Marines. Wow so the operations team, you know, used to always call me sir all the time.

Hannah Clayton-Langton: 30:19

Because they don't panic.

Hugh Williams: 30:21

They don't panic and they're good at putting structure around things. They weren't the coders, or they were the coders no, typically the operations team. So these are the folks who are sitting looking at tons of giant screens on a wall 24 hours a day and looking for changes in patterns or changes in behaviors, and they're the folks who sort of will run these calls wake up the right people, get the right people on the call and put structure around it. But US Marines are very popular in operations teams in large tech companies, for sure.

Hannah Clayton-Langton: 30:45

I once worked with and I'm going to get this wrong a guy who was an ex-US Army helicopter pilot Shout out to Matt if you're listening and we were working on quite a stressful deal and he was like I'm not being shot at, so I'm good. And yeah, he taught me through sort of how they managed themselves in like real crisis, you know, when they were out on the field in Afghanistan. You know securing targets, so I can see that they'd be pretty good to have around in a sort of fake emergency, when some code's gone down.

Hugh Williams: 31:14

And a lot of the builders, you know the software engineers who are actually building the software, just don't have that in their DNA right. They're sort of creative types who are a little bit artistic, a little bit scientific, sort of you know, trying things, building things, playing with data and whatever else you know, which is a wonderful thing, but they're not necessarily the people who can put structure around a crisis and provide updates. But you know, if you have a major outage I worked at eBay for a number of years. I think six months into my tenure we had a nine-hour outage of the search engine at eBay Nine hours.

Hugh Williams: 31:43

Disastrous, oh my God. It cost the company many, many millions of dollars.

Hannah Clayton-Langton: 31:47

So you were? What role at the time?

Hugh Williams: 31:50

So I was a vice president of engineering and I was in charge of search.

Hannah Clayton-Langton: 31:52

Oh dear, you had a bad day.

Hugh Williams: 31:55

Yeah, it was one of the worst things that could happen six months into a job.

Hannah Clayton-Langton: 31:57

Six months in, so you were sort of accountable, but you could sort of play the card.

Hugh Williams: 32:02

I got my bonus that year, so I think they didn't hold me completely accountable, but I was probably only two or three months off being completely accountable for it. You know, in that situation, what I found myself doing was listening into the bridge call.

Hannah Clayton-Langton: 32:14

So the bridge call is the sort of pile on call.

Hugh Williams: 32:17

That's it where all the key people have been woken up and they're all actively working on it. I think this thing happened on a Saturday and went through to the Sunday before it was fully rectified. It's a little bit like the CrowdStrike Falcon outage, actually, because the particular issue that happened on the machines in the data centre in eBay actually caused the machines to be so busy running the software that you couldn't interact with them.

Hannah Clayton-Langton: 32:40

Wow.

Hugh Williams: 32:41

So the CPU usage went to 100%, which means that the computer's not capable of doing anything except the thing it's doing. So you can't type and have the computer recognize the keystrokes. It's so busy doing the thing it's doing. So all these computers went to 100%, which we call pegged. We say the machine's pegged and so you couldn't interact with the machine.

Hannah Clayton-Langton: 33:00

And is that sorry that would be anyone that was on eBay or that was the servers running eBay, like the computers running eBay. So it's the servers running eBay.

Hugh Williams: 33:07

So you can imagine that there's many hundreds of computers that are the search system at eBay and all of these computers became so busy that they weren't capable of doing anything except being stuck doing this erroneous thing that they were doing. So quite a difficult situation of how do you fix a computer that won't talk to you. But in that particular situation we had the bridge call. We had all the key people on the call. You know they were all going to stay up all night and get this thing sorted out.

Hugh Williams: 33:33

Around every hour or so I would either join the call or I'd get an update from the director who ran the search team for me and he'd tell me what happened in the last hour. And then I'd very literally call every executive at eBay. So I'd call the CEO, I'd call the CFO, I'd call the head of the commercial team, I'd call the PR team, the comms team. I'd give them like the hourly update, I'd tell them what it is we're going to do over the next hour and then I'd say look, I'll be in touch in an hour and of course all these people want to know what's going on. I mean, this is a catastrophic outcome.

Hannah Clayton-Langton: 34:01

And were they like effing and blinding at you, like sort this effing thing out. Or are they like you're our only hope to get it fixed? So we better be nice.

Hugh Williams: 34:16

Look, I mean after you know, after several, showing that we understand what the issue is. I'm showing that we have a path to recovery. I'm showing that we've got all the right people working on this and we're going to get there. We're going to get there in the end.

Hannah Clayton-Langton: 34:26

And it's like we understand that the computer's pegging. But, like you can imagine, if you can't figure out what's going wrong, there's that panic period until you figure out what it is, where you're just like overwhelmed with error messages.

Hugh Williams: 34:39

Yeah, and this particular situation is really hard because you know the computer you need to fix won't talk to you. So ultimately what you've got to do is either reboot that computer so it loses all of the things that it knows about and it's doing and try and get it back into a state where it's operable, or you've got to create another computer that does the same thing and then take the one that's busy offline and put the new one online. But it's actually a really difficult situation to sort out.

Hannah Clayton-Langton: 35:04

And once you've sorted it out, what kind of sort of post-incident review like? Could people lose their jobs over doing something basically irresponsible or careless with the way they ship code?

Hugh Williams: 35:16

Yeah, great question. I think poorly run companies will fire somebody when they are doing their best, taking risks, trying to really get things done, and they make a mistake for the first time. Like a poorly run company will fire somebody. In that situation They'll say well, you know, you made a mistake, you're gone Guess. What happens then is nobody wants to build software anymore. So everybody's now very, very cautious, goes very, very slowly, doesn't want to do interesting, risky things that could really change the game. They want to just keep their head down so they don't get fired. And so if you have a culture of firing people who make mistakes, you end up with a pretty slow moving company. So I think that the art here is first of all, you've got to have what we call blame-free postmortems, so the situation's over. Great, we've got the service back up and running. Let's sit down and just have a really structured conversation about what went wrong and what are the things that we could do next time to ensure that these kinds of problems don't happen again.

Hannah Clayton-Langton: 36:15

And then everyone learns something, I guess right.

Hugh Williams: 36:16

We'll write it up. We'll write it up really well in proper prose, we'll share it around, we'll talk about it and we'll sort of celebrate the fact that we're a better company now, because we know we won't make that mistake again. Now, if the same person makes the same mistake after all that process, then I think we have to have a harder conversation. But we should just celebrate the fact that we're pushing the limits, we're being the best that we can be and we're learning and we're growing in a well-run company.

Hannah Clayton-Langton: 36:40

Even if you cause eBay to be down for nine hours, or even if you downed half the internet with the CrowdStrike issue, you think that's still sort of like a lessons learned before.

Hugh Williams: 36:49

Look, I think that the CrowdStrike Falcon issue I think there's a lot of really bad things that happened there that show that that team wasn't a well-run team.

Hannah Clayton-Langton: 37:02

And so I think that one probably has to go a little bit further. Well, and there's a whole sort of external lens here, right, if you're a public company or you're a security company like CrowdStrike and you down half the world, then you don't seem like that's secure an option anymore, right, and I think their market cap and their share price massively dipped as a result. So got a lot to do to clean up after those sorts of incidents, right?

Hugh Williams: 37:22

Yeah, and I think that's you know. That's an 11 on a scale of one to 10.

Hannah Clayton-Langton: 37:24

That you know, like I think yeah well, the worst in history, maybe not the best example to use, yeah, yeah yeah, but I think you know there's a lot of really hard questions to ask.

Hugh Williams: 37:31

I mean think of two or three questions that you could ask You'd. You'd say but did they test it? Like, did the engineer and the test engineer actually test this thing? Like, did they actually try deploying this file onto a machine or a couple of machines and start those machines up and see what happens? I don't understand how they didn't have some kind of test client, test environment, progressive rollout, you know. So if this was me, I would say well, you know, we'll deploy it to ourselves first, right? So we're CrowdStrike, we're obviously running CrowdStrike Falcon on our machines. Let's deploy it to ourselves and see what happens. And then we probably would have taken down our company, but not every company, and then we could have gone okay, you know, big mistake made, done the postmortem, been smarter and not taken down the whole of the internet.

Hannah Clayton-Langton: 38:14

Well, progressive rollout is something interesting that I don't think we touched on, which is essentially, as it sounds, right Like you might start with yourselves and then you start with a small tranche of customers to make sure that things behave as expected. Is that the same as something I've heard called canary testing? Is it called canary testing?

Hugh Williams: 38:35

Yeah, yeah, it is. I mean, let's just talk its own what we call environment, and there's probably lots of these environments for the engineering team. So the engineering team can stand up a mini eBay and they can just tested it themselves.

Hannah Clayton-Langton: 38:46

So it's like a simulated version of your external product, where you can test things in isolation. Yep.

Hugh Williams: 38:52

So you've got your own one, completely harmless. It's probably got some slightly fictional data in it. It's probably a scale replica. It's not quite as big as the whole system. Your testing team's probably got its own environment the engineers are never allowed to touch. So they say, okay, when the engineers have done with their sort of fiddling around, they give it to us and we actually test it and they'll test for the functionality. They'll also test things like load. So they'll say can the system handle the load that we'd normally expect?

Hannah Clayton-Langton: 39:19

The traffic load you mean there, right, the number of users? Yep, yeah.

Hugh Williams: 39:22

So they'll do all sorts of testing, and then there's probably what's called a pre-production environment, which is an environment where you put the next version that's going to go out to customers.

Hugh Williams: 39:30

Like a beta almost yeah exactly Exactly, Sort of pre-release, so pre-beta, and then eventually it'll actually go out to production. And when it goes out to production, coming back to the canary idea, we're going to slowly roll it out. So the first thing we'll do is we'll make it available. If you know some trick, right, so we'll get it running. But you can only use this version if you maybe put some extra characters in the URL in your browser and then we'll say okay, let's turn it on for 1% of customers. So we'll select a random 1% of customers and they'll get a consistent experience. That is this new experience. If that goes okay, then we might ramp up to 2%, 5%, 20%, 50%.

Hugh Williams: 40:15

So if your 1% of users crash, then you don't roll out to the other 99%. Yep, you say, well, something's gone wrong here and that's pretty unusual, right? Because we've now gone from an engineering environment to a test environment, to a pre-prod environment, out to production. If something odd happens there, it's probably related to real user interaction or real scale with real users or real user data or something that's difficult to simulate, like payments, for example. It's probably going to be something that it's odd that's happened. That's only going to happen in production. But again, if you're very careful with this, you've got this sort of clear path for releasing software, then generally the problems that you have aren't catastrophic.

Hannah Clayton-Langton: 40:49

And so if you're working in an agile development environment, you're shipping code like potentially every day, so that must make it quite chaotic to test simultaneously and roll out progressively many different things at once. Is that how you end up with problems where you just you change too many factors at once?

Hugh Williams: 41:07

Yeah, I think that's it. And look, you know, I think a test team would, in an ideal scenario, like everybody to deploy, get their next release ready, and then that's in a test environment, and then that test environment's stable for a good amount of time so they can do all their best work and then that, you know, sanely moves to pre-production and out to production. But the reality today is that, you know, really well-run technology companies that are at scale are probably releasing tens, hundreds or thousands of different pieces of software a day.

Hugh Williams: 41:38

So the test team doesn't quite have the luxury of a stable version of Amazon or eBay or Google or whatever it is, because this thing's just moving the whole time.

Hannah Clayton-Langton: 41:46

But I guess that's another one of those like balancing acts. Right, because you don't want to be slow, you want to be quick and reactive and agile. I guess that's where the name comes from. It's probably worth the trade that, like in a minority of instances, you've changed too many things at once and then you react rather than trying to plan for perfection and then basically slow yourselves down, right.

Hugh Williams: 42:11

Yeah, that's it. That's absolutely. The argument for these continuous releases is that you can move really quickly. Look, and I guess the argument also would be if you only change a small thing, the chances of it being catastrophically large or difficult to rectify is low.

Hannah Clayton-Langton: 42:25

And I presume you'd rather move forward, unless you're in like a panic state and the easiest thing is going to be to just click undo. Yeah.

Hugh Williams: 42:32

And I think you know modern software engineers today would say I'd rather fix forward. I've got V1 out there and then I make a change it's now V2, and then V2 is something wrong with it. I'd rather quickly go to V3 than go all the way back to V1. But if the failure is catastrophic, so the whole site's down, I'm like, well, I don't know what to fix in order to go forward to V3. So I'm going back to V1 as fast as I can, if I possibly can.

Hannah Clayton-Langton: 42:55

Yeah, well, yeah, and I think we said earlier that CrowdStrike couldn't go back. So they had to figure out something forward. Okay, and are there different ways that different companies approach outages or bugs? Like it sounds like if're super cautious and maybe in certain industries that's the right thing I'm thinking like banking or security you want to be a bit more cautious but are there different ways that you can anticipate these things happening, that sort of match the industry?

Hugh Williams: 43:21

Yeah, I think you're spot on. I mean, I think you have to assess what is your regulatory environment, your legal environment, your risk appetite, your governance environment, the state of your customers, what your customers would expect. I think you have to assess all of those things. It's a bit qualitative, a little bit subjective and then you can kind of set the dial in the right place as to how fast you want to move versus how safe you want to be, and those are enemies of each other, right? So if you and I work in a tech company and we employ a chief security officer, the ideal situation for the chief security officer is we never change anything ever. System's completely secure. We don't want all these engineers releasing software. I mean, that could cause all sorts of trouble, but that obviously is not what we actually want. And so I think you know security folks sometimes legal folks, if I was going to pick on them really want us to do nothing.

Hugh Williams: 44:07

Engineers want to. You know, let loose. I want to. You know, let loose. I want to build stuff, get it out as fast as I can. I hate bureaucracy. All this testing stuff's overrated, like let's go, go, go. And so I think as a leader, you've got to really. You got to really set the dial to the right point and I think that's not just a, that's not just an engineering and product conversation. You know that's a. Who are our customers? Regulatory environment you know who are our customers. Regulatory environment you know banks are a great example. I mean, if you continually break things and take risks and don't follow all the regulatory guidelines, I mean the banking authorities will actually put people in your building who watch what you are doing. So I know one of the sort of neo-banks in the UK currently has, you know, government people in their building looking at everything that they're building and making sure that they start to comply, because they've just moved a little bit too free and fast and loose in an industry that you know I guess.

Hannah Clayton-Langton: 44:55

You've got a lot to lose if the banks go down.

Hugh Williams: 44:57

Got a lot to lose and you know, I guess it's a very, very highly regulated industry with a lot of controls, right, because there's things like you know, money laundering, fraud, crime, all these kinds of things that need careful governance and control. And of course, you know, obviously people don't want to lose their money and you know the government doesn't like it a whole lot if you're too free and loose in the banking environment.

Hannah Clayton-Langton: 45:16

Okay, and let's talk a little bit more about when the problems get through. Like you had the eBay screw up, were there any other like big outages that happened on your watch anywhere?

Hugh Williams: 45:25

Yeah, Look, I had a pretty interesting issue when I was working on Google Maps. It was pretty problematic actually. It made it into the Guardian and a whole bunch of other newspapers. Yeah, so what happened? Let me maybe just give you a little bit of setup first before I tell you the particular issue.

Hannah Clayton-Langton: 45:54

So the setup is there's a team in my organisation and their job was to pick, from all the possible labels, of all the possible points of interest, which labels to show on the map at any particular zoom level, because sometimes if I open a map it will show you like certain cafes or certain restaurants or certain landmarks, but not all of them right.

Hugh Williams: 45:58

Not all of them right. And if you zoom out enough, you start seeing the labels of states or counties. You zoom out even further, you see the labels of countries, right, maybe continents, these kinds of things. And the more you zoom in then you'll start to see post office boxes and labels for shops and all these kinds of things.

Hugh Williams: 46:13

And of course you know, if you're trying to zoom in a long way, you've got to really think about which labels to actually show. And so some context about you as a human where have you been before? What are your interests? Would help Some context around, sort of what are other people interested in? All these things would be very helpful in choosing which are the labels that are most likely to be useful to you as a human right.

Hugh Williams: 46:32

So there's, a team of engineers that work on this. It's actually a really hard problem. If you're ever on a plane and you're watching the map on the plane, it does a terrible job of it. It will serve these random cities and random trenches in the ocean.

Hugh Williams: 46:45

Not super useful it's not really great geography education it's because their label selection is pretty poor. Okay, Right, so it's a hard problem. We had a bug. Bug was a pretty simple bug and the issue was at certain Zoom levels we were selecting the wrong labels. Simple as that. Right so simple bug. You know, supposed to be showing a certain set of labels, we're not showing the right set of labels, and the particular manifestation of this bug was that the labels West Bank and Gaza were removed from the map of the Middle East.

Hannah Clayton-Langton: 47:16

Okay, so that's like a political statement without meaning to be a political statement, and so the Palestinian Authority noticed this.

Hugh Williams: 47:24

They actually jumped to a bigger conclusion, which was Google has removed the label Palestine from the map. We'd actually never labeled Palestine.

Hannah Clayton-Langton: 47:34

Yeah, you know I can't comment on those kinds of issues, but I know the.

Hugh Williams: 47:38

US doesn't recognize Palestine, and there's a whole bunch of issues that are way above my pay grade.

Hannah Clayton-Langton: 47:43

Yeah, yeah, yeah, we'd actually never labeled Palestine.

Hugh Williams: 47:46

What we had done was accidentally with a bug, remove the labels West Bank and Gaza, and so this caused a major international incident. We fixed the bug.

Hannah Clayton-Langton: 47:58

The labels came back, and I presume that was happening in loads of locations, not just in that bit of the Middle East, but it just so happened that where it's sensitive it's sort of surfaced much more quickly than if it was just me.

Hugh Williams: 48:09

Exactly, yeah exactly and quite rightly. There's lots of folks who are very, very sensitive about what labels appear and you know it's, you know Google. Certainly, when I was there, I was having lots of conversations with governments and authorities around the world about what labels were present. We often got lots of requests to take down things, to fuzz out certain images of certain installations or whatever it is. So there's a lot of sensitivity around what exactly is Google showing and not showing? And because Google Maps has over a billion monthly active users, there's a huge user base out there looking at the product and reacting to anything that the product does. So, yeah, that happened on my watch. It was an honest, simple mistake in our label selector and we fixed the bug. We rolled forward, we moved on.

Hannah Clayton-Langton: 48:53

Yeah, wow. It's crazy to think about the impact of some of this stuff, particularly when you think it's a free service on your phone and, like all of these processes and teams and you know communications with governments have to be put in place for something that I would be offended if I had to pay Google like pounds to download on my phone. I use it every single day.

Hugh Williams: 49:14

Yeah.

Hannah Clayton-Langton: 49:15

Okay, so basically, screw ups happen. Anything else you think we need to cover when it comes to outages and bugs?

Hugh Williams: 49:22

I think we've done a pretty good job. The one thing I would say is, if I was giving some advice to the listeners about how to think about it within your companies, I'd say, look, engineering is about making the trains run on time. Right, it's a lot about process, it's a lot about structure, it's a lot about rigor, and that sounds really boring, but if you get that right, then that'll free the company up to really move fast and build great software. And so I think taking these kinds of topics really seriously, treating them with the importance that they have, making it sort of a fabric of how the company works, will mean that ultimately you can do a lot more as an organisation. So I'd say always invest in this stuff.

Hannah Clayton-Langton: 49:59

Well, that has been the Tech Overflow podcast. I'm Hannah.

Hugh Williams: 50:03

And I'm Hugh. If you'd like to learn more about our show, you can always visit us at techoverflowpodcastcom.

Hannah Clayton-Langton: 50:10

We're on LinkedIn, Instagram and X as well, so Tech Overflow Podcast.

Hugh Williams: 50:19

Yeah, and we'll link into the episode show notes a whole bunch of resources that you'll find useful, as always.

Hannah Clayton-Langton: 50:22

As always, okay, great. Well, looking forward to recording with you again, probably virtually, next time.

Hugh Williams: 50:27

Yeah, that'd be a shame. Being in person has been so, so awesome.

Hannah Clayton-Langton: 50:30

Yeah, it's been awesome. I need to get to Australia more.

Hugh Williams: 50:32

Yeah, you should. You should just move there and we can just take this podcast seriously.

Hannah Clayton-Langton: 50:35

Well, I'll pick that up with my husband. Okay, thanks so much, hugh. I'll talk to you soon.

Hugh Williams: 50:39

Thanks, hannah, bye Take care.

People on this episode

Hannah Clayton-Langton

Co-host

Hugh Williams

Co-host