In episode 1 of Artificial Chaos, Holly and Morgan introduce the concept and benefits of chaos engineering, and how to get started running your own experiments. Holly does some impressively dumb things with code, and Morgan adds some whimsy with weird analogies about boats.
Absolutely do not go into your place of work and destroy a bunch of servers and tell them that Holly and Morgan told you to do that. We accept no responsibility for any service outages that are caused as a result of you listening to this podcast.
Next time you accidentally break a server, just tell your boss, it's Chaos Engineering! So how would you describe Chaos Engineering? If you had to give a quick definition?
Chaos Engineering is this concept that we experiment on production systems in order to build confidence in how those systems will perform under duress.
Does it have to be production?
Not strictly speaking, but, um, unless you have pretty much an identical copy of your production system, it's never going to respond in exactly the same way as a production system would. So it doesn't really give you the same confidence.
Yeah. So if your test environment doesn't match your production environment, then testing in test is not going to be a very good test. I know that it sounds dumb, but there's so many organizations that are in that situation. Um, you say experiments, so what are we, what are we talking about here? So earlier when we were talking, I made a distinction between tests, meaning for me, unit tests and integration tests. And then you're saying chaos experiments. So why, why are we using a different word here? What does experiments mean?
Um, I think experiments are more open-ended rather than considering it as a specific test, the parameters won't always be the same. And you also don't know exactly how a system is going to respond. You can't really hypothesize well in advance for these. So there's a concept of blast radius and limiting the potential impact that your experiment will have and start off with something small.
So I wrote a blog post a little while ago, where I was talking about just kind of introducing this idea of Chaos Engineering, to people who hadn't talked about it before. Who hadn't maybe seen that term before. And for me in that context, I was being a little bit facetious, but one of the things that I said is like, Hey, I turn systems off to see what happens. And for me that is pretty much Chaos Engineering. Not that it is limited to just turning things off, but like that idea of let's disrupt the system and then, and then see how it reacts. So when I was doing those experiments, one of the things that I was looking at it was really just answering the question of, we have a system that we know when an instance goes down a new instance will be spun up and the system will handle that disruption.
But what we didn't know was how long would that take, so one of the ways that we tested that was, Hey, let's turn a web server off and just time it, how long does that instance take to come back? And then we're also looking at things like, Hey, when we turn an instance off, how visible is that to the user? I mean, ideally not at all, right, but you turn so many things off. Eventually something's going to be visible to the user, or maybe you've built the system in such a way that when you cause some disruption, something unexpected happens. But that was really how I started with Chaos Engineering was we've built a system. We think it's resilient, let's turn some stuff off and see not only how it breaks, but how long it takes to come back. Is that your experience with Chaos Engineering? Have you, have you, are you thinking about this in a different way?
Um, so I think you really need to define what you mean by resilience before you can begin to experiment in order to achieve resilience. And there are a bunch of different interpretations of resilience and different metrics that you can use to define it. So do you mean absorption or recoverability or, you know, mitigation of incident or event in the first place?
So what do we mean by resilience? Then is going to be my next question, because for me resilience, like I would try and define that as something like a system's ability to continue functioning, even if components fail, for me, resilience means something very different to availability, but in my head they're somewhat linked. And whenever anybody says availability, I think cybersecurity, right? It's one of the pillars of cybersecurity: confidentiality, integrity, and availability. So I would think of denial of service attacks, but that is still linked to resilience. Right? If the system is very resilient, then having an availability impact would be difficult. So for me, resilience is how well can it handle some component level outages? But you've used some different terms though. So absorption could be, how well a system can handle a high traffic load. That could be one thing.
What about elasticity?
Oh, elasticity's an easy one. So elasticity is, is a, is a big thing for me because our systems, um, scale up based on user load, right? But elasticity is not just scaling up. It's the ability to scale down when that, when that user load or reduces as well. So one of the things that we're trying to do with our systems is effectively when, when the system is quiet, run them as small as possible. And the reason for that is of course, with public cloud systems we're running on OPEX, right? So we're, we're effectively paying per usage with public cloud providers, there's lot of different ways of managing that you can use reserved instances and things, but keeping things simple. If users aren't using the system, we want to scale things down so that the cost is reduced. And elasticity is part of that, and for me, elasticity is really when you're talking about scaling down, as opposed to scaling up.
A handy part of auto scaling in AWS is that you can scale based on instance number or size, which is really handy. Interestingly, the way that I learned about auto scaling and how handy that can be for business, I was actually studying for my Solutions Architect Associate exam at the time. And there was a, um, medical supply conference in America. And I think it was like a couple of days before the conference, that website was hit by a DDoS attack. AWS basically said, we can scale up your instances that are, like your provisioned instances. And we'll just absorb this as much as we possibly can and see what our systems can handle. And they scaled up and they didn't scale back down and just sort of absorbed the attack until it phased off. And then they kind of split the bill at the end of that. But I think that sort of covers scalability and absorption, which would be two metrics of resilience, or what I would personally define as resilience.
I think most people who deploy auto scaling groups. So this idea that when the user load increases more instances, are, deployed people think about that in the context of elasticity where it's like hey, if we suddenly have a bunch of users come in, we'll scale up to handle that. And then when those users leave, we'll scale down again, that isn't for me, the major benefit of elasticity, I think just because it's, it's such a core part of it now that it's almost like I don't even think about it. That's just a part of what it is, what I like auto scaling groups that what they can do is monitoring the health of instances. So if an instance has a failure, for some reason, even if that failure is something really dumb, like, oh, that instance log filled the disk and it failed because its disk is full now.
They auto scaling can detect that. That instance has an outage and then can tear it down and spin up another one and can handle that. So for me, one of the things is, is also not just the ability to handle user load, but it's to handle unexpected outages in, in that way. Now some people might be, like, listening to that and thinking that's, um, that's crazy because your systems should never get into that position, but that's the whole point of what we're talking about today, right? It's like, Hey, sometimes unexpected things happen. We should build systems in such a way that they can handle unexpected things. Um, especially from the context of the user, it's like, I never want the user to know that something failed. If I could possibly help it. It's like, can we, you know, aggressively handle it in some other means?
So, yeah, it's, it's almost funny now that auto scaling groups to me, have almost got little to do with scaling anymore and just kind of become more to do, to do with resilience. Um, so chaos engineering, I think we've defined pretty well now. It's like we're experimenting on our systems. So we're going to introduce faults to see how our systems, um, handle those. I think for a lot of people who are listening to this, maybe they've still got an old school way of thinking about this. And when they think about outages, they might be thinking in the way of like disaster recovery, business continuity type scale. And they might be thinking about like, Hey, you have an outage and then you go and get your incident response plan out and you start going through the incident response plan. But what we're talking about is trying to build systems in a way that they're resilient by which I mean that outage is just handled automatically, right? It's like we don't need human intervention here. We want to build a system in such a way that it's just, handled.
I would agree, but there are other benefits to Chaos Engineering rather than just building self-healing infrastructure. So you can architect systems to ensure graceful degradation of services for an example, LinkedIn's project LinkedOut, which is their application infrastructure, Chaos Engineering project actually introduces this concept of graceful degradation. And it identifies like core workflows and processes on a particular page that will be accessed by a user. And if there is some component failure or an error on the backend, it will prioritize those core workflows and processes over say running ads, loading, kind of, third party content. But again, you would need to define what your core workflows were, before you could really put that into practice.
I think that it's like, you need to define these things. It's such a big part. I was working with a company recently who I was talking through their, really, their disaster recovery process, but like, Hey, what would happen if you had an outage and how would you recover from that? And one of the things that was very interesting to me with that with that company is they hadn't even defined which systems were critical. Like they had no priority for, for systems and they were absolutely adamant that they could recover from an outage because they kept pointing at backups. And they're like, Hey, if this server disappears here at the backups for that server, and what I was trying to point out was you might not have an outage of that level. It might not be hey that one server explodes. It could be like, Hey, the entire network is down.
How are you going to recover now? And I'm sure that they would have been able to recover because they have the backups, like the technical process of doing that would have been okay. But what I was worrying about was like that priority that you mentioned there just like, Hey, you want to bring up the important systems first and, and they didn't have that. So I guess, can we look a little bit about some of the definitions we have around first disaster recovery? And then maybe build up from that. So I've talked about system criticality there, but there are the metrics as well. Right? So there's things like RTO and RPO, should we bring those in?
Yeah. So RTO, recovery time objective and RPO recovery point objective. Broadly speaking, are defined as the maximum tolerable downtime that your systems can, can, can deal with. And the point in time that you can withstand as a business from a data loss perspective.
Yeah. That's exactly how I think of it. It's like, RTO is, how quickly do you want the systems to become available again? But RPO is how much data loss are you willing to accept? So I almost think of it in terms of like, so RTO would be counting forward from now, how long are you happy for, for these systems to be down for? Whereas RPO would be how far back in time to want to go. And I think a lot of, a lot of companies have maybe not thought about these things. And one of the things that should be, they should be like the absolute maximum that the company can accept. And then they should also be in my opinion, like a desirable value. Right? So one of the things that, that I've been playing around with recently in terms of experiments is not only can, can the system recover from an instance failure, for example, but how long does that take?
One of the reasons there just being, knowing exactly how much resilience we have in that system, even instance goes down, how long should that tank that's useful for knowing if something else has gone wrong? You know, Hey, this instance usually rebuilt itself in like one minute forty-five or something. It's been three minutes. Maybe there's something else that's going wrong. But I think from a business point of view as well, it just, it makes you really think of that criticality side of things. If you're saying, Hey, all of these systems are going to go down. Maybe you have a different RTO depending on how important that system is to the company. When, when I was talking to that company, I mentioned a second ago, one of the things that they were prioritizing based on was whether it was publicly facing or not. So the thing that they wanted to bring up as quickly as possible was their email systems. So that like, Hey, if somebody emails us, we don't want to miss that. Whereas internal systems that were only used by their staff, even if they're important, they're like, oh, we're going to have a lower criticality for that because nobody will notice it's like, they will feel the pain, as opposed to external people like customers feeling the pain.
Yeah, absolutely. I think it takes quite a lot of maturity in terms of risk management for an organization to have defined firstly, what, you know, assets they've got, because there are so many organisations that don't even know what's on their estate and then to have defined critical services and assets and processes that are running, whether you decide to prioritize things that are client or customer facing or internal kind of finance processes or payroll systems and things like that, it's going to be individual to each organization, but then experimenting beyond your business continuity and disaster recovery strategies is sort of a level above really when it comes to maturity of risk management and operational resilience management as well.
It's so funny to hear you say that cause cause it's absolutely true. And I know it's true, but I almost don't think about it anymore from our systems where you say there are many companies out there that don't know what, what assets they've got, they don't have any kind of asset management system. We can pull this up into other topics as well, story for another time. But one of the things that I always like to see from customers is can you tie your asset register to your vulnerability management platform? So not only do you know what assets you have deployed, but can you tie vulnerabilities to those assets and then can also, can you tie dependencies to those assets? So it's like, Hey, this service is really important to us because it stores all of our data, For example. A lot of companies don't think about, yeah, but if this other system is down, if the authentication system is down, for example, we can't log into that system.
So it's like, yes, the asset is important, but also it's dependency is important. And I think that's really where we start getting back into Chaos Engineering is like building, as you said, that maturity. Where it's like, not only do we know what assets we have, we know what's supposed to be communicating with other things, what their dependencies are, but now we're going to test that in a really interesting way. So I guess we should talk a little bit about, about testing because I think when you use that, that term test, instead of experiment, as we were using earlier it's like test means something really different to different people. Right? So I do a lot of software development. So when somebody says test to me, I immediately think like unit tests and integration tests. And it's like, is this piece of code in isolation working the way, that I expect it to is this piece of code in-situ working the way I that expect it to, but that is very much something that I have defined rigorously. Right. So for the unit test, it might be something like when I type a username can the system accurately tell me whether that username is taken or not. Whereas what we're talking about here with Chaos Engineering is more experiments. So it's like the unexpected side of Things.
Yeah. So if I switch this server off, what's going to happen. Like you don't have a predefined outcome, like a, a binary result that you can rely on to kind of prove that your supposition was correct. Um, yeah. It's much more open-ended. Um, which is why I suppose chaos experiments as more accurate and more relevant than chaos testing.
Yes. Because like, you should still, you should still have a hypothesis of what's going to happen and you're testing that hypothesis, but I guess we're going into it knowing that what we're doing could break something unexpected and we should be prepared for that. Right?
Yeah, absolutely. Um, but then also you've got the, I guess, consideration that with cloud infrastructure part of the benefit of kind of the scalability and how quickly you can deploy new resources and things is that you can kind of just spin things up without maybe much knowledge or understanding of that cloud providers, catalog of services and things. So AWS have this framework, the well-architected framework, which will broadly speaking, teach you how to deploy, an, application that is, uh, like resilient. So, um,
Yeah. Yeah. I was just as listening to you, try and say that. And it's one of those things where you don't want to use one of the words that they've used as, as a pillar. It's like, you don't want to say the well-architected framework allows you to build a system that will perform well, because performance is one of those pillars. It helps you build a system that is well architected. Is one way of putting it another way is it helps you build a system that is good? Broadly good.
So the well-architected framework has five pillars that help you, I guess, from my point of view, it's just like, make sure you've covered everything, right. So how, you know, historically we might have talked about people, um, building systems that are functional, but maybe not secure that's one weakness. The well-architected framework tries to help you mitigate that, but there's also building a system that's maybe cost ineffective. The well-architected framework tries to help you mitigate that. So the way that I think of the architecture framework is just like almost like safety rails of just like you're building a thing and it's gonna set out a way that you can build it and consider the major aspects.
Yeah, no, I would agree with that. And I think as long as you've kind of followed that you do have like the base capability there to be able to perform some of these experiments on your infrastructure, it doesn't really take a great deal. So I said before that it takes maturity and your risk management approach. But I think for a lot of startup companies and like new tech companies that are using cloud infrastructure that maybe don't have kind of 30 or 40 years of in-house like risk management experience, you can still get involved with this and it's still something that can benefit your organization.
I was just cheating there and quietly searching for: "How does AWS actually describe the AWS well-architected framework?" And by look of things they cheat as well. And just point at the five pillars anyway, like AWS, well architected helps cloud architects build secure, high-performing, resilient, and efficient infrastructure. It's like, pretty much just listing the pillars there aren't you missing of course, operational excellence, which I always think sounds great. It's like helps you build excellent things.
So how do tiny companies, um, tech native, cloud native companies do Chaos Engineering Holly?
Wow. So many things to break down from that one sentence, two things that I want to talk about tiny companies. Let's, let's dig into that in a second, but also you used the term cloud native. And this is something that I'm very, very passionate about. I'm very passionate about this concept of cloud native, because I think a lot of organizations at the moment are moving to the cloud, but they're doing that in. I don't want to be rude, but like maybe a naive approach to going to the cloud. For example, maybe they had systems that were on prem where maybe they, overtime, had gone from having physical servers to virtual servers, and then they had lifted and shifted those virtual servers into the cloud. So now they're just running effectively VMs on somebody else's hardware. So, you know, a virtual machines for Azure or EC2 for AWS. I don't really in my head, and this is going to sound a little bit surprising to some people, I don't really consider those cloud systems and people might point at them and go, but they're hosted in the public cloud it's like, it's not a cloud system though.
They're legacy systems that you've basically containerized in an around about fashion. Um, I, I think about it like this this as well, so I sort of think about moving to the cloud or deploying infrastructure in the cloud as getting on a boat. That was my English accent, but that, that accent...
Pause to appreciate the word "boat" there for a second.
Also. Sorry, what?
Yeah. So if you think about, um, there's like a little fishing boat or something like a rowing boat at the end of a dock and somebody who is set up to appropriately use cloud infrastructure or to begin deploying infrastructure and assets in the cloud, will just get in the boat and kind of row away. But there are lots of sort of older companies who are really kind of sentimentally attached to all of their on-prem crap and really don't want to get on the boat. So they've got one foot on the dock and one foot on the boat and they've like un-moored it and it's starting to drift away a little bit and they're going to fall in the water.
I love that analogy. Allow me to try and make it technically coherent.
Please do that.
So the way that I think about it is the first step to moving to the cloud is hosting in the cloud, right? So that would be absolutely virtual servers. Hosted on somebody else's infrastructure may be looking at things like IaaS, so infrastructure as a service, but what I think of when somebody says a cloud system is really the extreme right-hand side of that. So function as a service. So as you have functions or AWS Lambda at that kind of thing, where we're looking at technologies like serverless and there's a whole, there's a whole pathway that right from moving from virtualized systems to moving to microservices, functioning as a service serverless services. Absolutely. But that's what I think of as, as cloud services. And I think the, way to kind of summarize that if somebody is trying to like follow along with what we're saying here is, has it been moved to the cloud or was it built for the cloud?
That that's the real distinction to me. Yeah, absolutely. You mentioned tiny companies though. So I guess there's two sides to this. Like I'll rant about tiny companies in a second because I love startups, but in the context of Chaos Engineering, I guess the implicit question there is, is chaos engineering, something that you build up to through scale because we have through this discussion repeatedly used the term maturity is that an organization gets to certain age, a certain size, a certain number of employees before Chaos Engineering becomes important. Or is it something that you can do from day one?
No, I absolutely think you can do it from day one. So I think if you have cloud native infrastructure, you are at an advantage because your systems, dependencies, interactions, processes are all going to be mapped in a more predictable fashion. You you're likely going to have a better understanding of your estate and how it functions. Whereas if you have sort of some on-prem legacy infrastructure, and then you've got some kind of cloud deployed infrastructure as well it is much more difficult to see or to predict how it's going to react. So I think a lot of kind of newer companies that maybe don't have that maturity already are actually in a better position to kind of run these sorts of experiments.
The, the way that it works is we all know how virtual servers work, right? Get a computer, you cut it up into pieces. That's a virtual server. On the other side of things, we have serverless - that's magic, right?
It's just somebody else's hypervisor.
Say that it's just like, it's just, I send code out into the cloud. Some thing somewhere runs it. And then it gives me a response that's it.
It's not really serverless. It's just not your server.
Somebody else's server. So like, like, we're going to have to come back to serverless in the context of supplier security questionnaires, because I've recently had some, some significant pain with in particular auditors who don't understand some of the weird stuff that we're doing with infrastructure. I'd like to talk about that another day, because I want to talk about the cool stuff we're doing and the infrastructure, because that's awesome. But also just some of the questions we get asked from a security point of view, are pretty crazy, but small companies. I think the thing here is exactly like we had a second ago with what does cloud system mean? It can mean a lot of different things. Small companies can mean a lot of different things, a sole trader is a small company, a micro company where they're just small by nature, you know, they're doing something, they have no intention of scaling.
Maybe it's just a few people you could think of something like a solicitors or an accountants where they're just a company that is small and intends on staying small. And then you could also think about things like startups. We'll have to have a rant about startups at some point, because I think startup for a lot of people means any new company. I hear the term startup applied in so many different ways where somebody would say like, oh yeah, I'm working for this startup. And then there'll be like 10 years old. And it's like, is that really a startup. It's like, I'm working for this startup, they're on series H. And it's like, is tha-, they're on series F. Is that really a startup? Um, the reason I did that is, I don't want the internet to know how I pronounce the word H.
It can't be worse than how I pronounce boat.
Uh, you know, they're on series F, is that a startup? Or you might be talking to somebody about like, um, oh, uh, Unicons right. A startup that's valued over a billion dollars. So it's like, is that a startup? So for me, the word startup is so poorly applied because it somehow applies to everything from a micro company that is never going to scale to a billion dollar company that's 10 years old. It doesn't make sense. So we'll have to, at some point, have an episode about startups and have that rant, but bringing us back to Chaos Engineering. Yeah, my experience, you know, um, I run a startup, our systems are built for the cloud. We run experiments all the time to test how, how things work. And in fact, I've, I've had some conversations with you recently in terms of just like, have we built this thing? Well, is there anything that you can think of that we've missed? And like I said earlier in the show, my experience with that was really just turning things off and then seeing what happens.
I think an important thing to note though at this point is that doesn't work if you don't have monitoring in place.
That was the pause. What I was trying to prepare for was somebody like accidentally rm-rf-ing, a server in production, and then just like emailing their boss. Oh, we're doing chaos engineering today. Has anybody told you? Server on fire.
Yeah. This is not a get out of jail free card. You can't just break things and tell them it's Chaos Engineering.
Remember we had a call the other day and I blew a server up. Um, do remember I destroyed one of the databases. So it was like, oops, database is gone. It's like, I'm just going to wait for 2 minutes.
You were surprisingly chill about that as well.
This is Chaos Engineering though! Like I was surprisingly chill about that because like I've had that kind of failure before we have tested for what happens if a database has a major outage. And I know that as a first step auto-scaling with takeover. So the way that that would work in that context is if I've done something bad to the database, stop, the database service, auto-scaling will pick up on that because it's, um, health metrics will, will go into a failure state and then it will kill a database and spin it up from a known good copy. So yeah, it is. It is that weird thing, I guess, I guess that's the whole journey that we're talking about here with, with Chaos Engineering, isn't it, it's like you start with some testing where it's like almost the beginning of the testing journey is, is this functioning it's like, do I have some way of knowing that this is working?
So for example, if we push some new code to production, is that code just working? Have we broken anything? And then you go from that to like user acceptance testing and those kinds of things. It's like, not only is it functioning, but is it functioning in the way that the user is expecting? And then as you build up in this chaos journey, you do get to the point where you're like, oh, that thing blew up, but it's fine because we've tested it so many times. And we know that the system is just going to recover from there.
The really cool thing about the example that you just gave about the database is that in a typical or traditional company, maybe, um, the business continuity plan that recovery time or point objective might be several hours for a critical service. And it might be that they can tolerate up to 24 hours worth of data loss as their RPO, but the health metrics monitoring for auto-scaling on your database would pick that up really quickly, which brings your RPO back down to a few minutes, just as, as long as it takes to pick up on that and then deploy a new database. So it really does change business continuity completely.
There's also just like this ability to observe things happening. So even if you don't necessarily change the business's risk appetite in terms of where your RPO is, just knowing that it's happening is, is better. Right. I guess we can talk a little bit about AWS game days in a second. Um, but just like, it makes you feel more comfortable if you know what's to happen. And if you know what the state, of the system is, I guess again, there's a different maturity journey here in terms of like organizations on one side of the spectrum, not even knowing what assets they've got, let alone, if those assets are working correctly and then building up from that, it's like, we know what everything is. We know where everything is. And then building up from that it's like, or you accidentally blow up a database. And then you hope that your auto scaling groups are working. Um, and then there's building up from that, and it's like, Hey, if there is, uh, an outage, it's fine. But also like we, we know it's happened and we've picked up on that pretty quickly. I mean, I'd love to know how many organizations public facing website could just go down and then how long it would take for them to even notice that it's gone down.
I bet there's something on the internet that tracks that kind of thing. There's gotta be some stats for that.
There is. There's some companies that, that, um, track that stuff where that is a service that they offer where they'll track systems. Also going down, it depends on what you mean by that. And that's the whole thing with, with Chaos Engineering, right? It's like, we're not just talking about a system outage, but it could be a component level failure. So Hey, your public facing website is up, but nobody can log in like that is, you know, that's an issue.
People come to this podcast to hear that's an issue. Your website's not working. I'll tell you something, something else, which has been on my mind because we're talking about monitoring something that I've been playing around with. It's not, it's not directly chaos engineering, but it's just another thing to think of when you're thinking through different kinds of testing from unit testing, to integration testing to Chaos Engineer is, if somebody made a failure, that would be difficult in code to detect, but you'd really want to know about it. I can give you an example of this.
What if somebody changed the color of the login button to match the background? So if your test code comes in and checks, you know, is there a login button? Yes. Does the login form work? Yes. Can you type in the input box? Yes, but users can't see the button so nobody can log in. It's like, how would your organization detect that. This is brought to you from a recent code change that I made.
Did you do that?
Uh, no, I didn't. That's actually a textbook example, but one of the things that I actually did do was, um, I'm not going to go to the long story of how this occurred. And some people no doubt are going to contact. And be like, how on Earth did you do this? But what I did was I changed the z-index of the input fields so that they displayed correctly, but a user couldn't click them, because they were effectively behind the forward div.
So all of our system tests passed because the page loaded correctly, the server responded quickly, the login form was loaded. And you could navigate the page with the browser. If you were for example, to use the tab key, to select inputs that would select them, but you couldn't click them. So the automated testing that we have, we use effectively a headless web browser. So we drive through selenium. Uh, so we can actually test, like, not only does this page load, but this page loads in a browser that user are using and all these example, user activities would work correctly. Didn't, didn't work for that one though. So yeah, testing, like you can get really, really far into testing if people are interested in, in how we detect that - logins per second. We have a metric that is logins per second. And when that falls off a cliff, you know, something broken, in your login box, it doesn't really matter what it is. Nobody's logging in anymore. Something has gone wrong and you might want to investigate that. So I guess that brings us back to metrics around Chaos Engineering. And like, we've talked about RTO and RPO, but I presumed there is other metrics that might be useful from an experiment's point of view.
Yeah, there really are. But the point that you just made I think is, is kind of more valid. Like it, it links back to another kind of value add that you can get out of Chaos Engineering is that it identifies problems with your monitoring. So you can drive maturity in that space and improve thresholds, metrics, what your tolerances are and just your kind of fine tuning of those things, which makes it easier to detect incidents and outages and failure in future.
So you. So we keep saying monitoring and I gave an example of how we monitor our system. But I guess in the context of just like broader cloud systems, what do you mean by monitoring? Do we have like a constant ping running? Is this just like the server exists. It's all probably fine or is there a little bit more to monitoring? How do you know your systems working?
[Obscure English Literature reference].
You have to explain that reference to somebody who doesn't watch TV.
That's Shakespeare! That's the Merchant of Venice. I don't watch TV. They didn't either.
Literature degrees. Honestly, that's something that's going to come up on the show at some point is that I have a real degree - Information Security.
I have a Mickey Mouse English degree.
What's, what's the full title of your degree?
English literature with French.
That's not even an English degree!
It's got French in the name!
But it was a minor. It was only a little bit of French.
Do you know what my minor was?
That's the correct way to pronounce that word for all the Americans whose brains just exploded - privacy. Uh, yeah. Information security with privacy. Oh man. Could you imagine doing like Information Security with French? That would be...
I think that'd be really fun actually, because then you could actually understand that French security conference that you go to every year.
Yeah. Hack in Paris, or Nuit du Hack. Nuit du Hack. There's no H in French. I would do much better in French than it would in English because I don't pronounce that letter very well. So yeah, I guess from a, from a monitoring point of view, I mentioned logins per second, being one of the things that we track, but really like there's a whole host of things that you should be tracking. Right. I mentioned earlier an instance, failing because it's disk got full, surely that should never happen right. You should, you should be monitoring those kinds of things of just of like, um, system health should be quite a broad thing. That's, that's monitored.
Yeah. Well, I think you need to, again, define what healthy looks like for your systems so that you can implement some baseline monitoring. Um, and then you can use these sorts of experiments to tune that.
So we've talked about experiments, generally speaking, and I gave the example of occasionally I turn production servers off to see what happens. Um, I actually want to do a conference talk at some point where I like get on stage to talk about Chaos Engineering and then turn some production systems off as kind of like a mic drop, case in point, you absolutely know that's not going to go well just like...
The demo gods will not smile on you that day.
Break, bring production down in the one-time where it's the hardest to fix it, because I'm literally on a stage. I just think it's one of those, like, you know, practice what you preach kind of thing. It's like, if I'm going to talk about Chaos Engineering, I should be able to demonstrate that I can blow systems up and the systems are going to recover.
So there are a bunch of other experiments that you can do, which I think is what you are getting at. So there are two projects, um, that I can think of that have notable experiments, um, of examples in this. Um, and the first would be Netflix who basically pioneered Chaos Engineering. Um, and they wrote a tool suite called Simian army. Can you say that Holly?
Yep. That's right.
What's been got at, there, is the fact that I was today years old, when I realized that the word is Simian and not Symbian, I don't know how that happened, but I presume - I was really into mobile devices when I was younger and Symbian is a mobile operating system. I remember playing around with that a lot, certainly prior to like iPhone applications and that kind of thing. And it was kind of looking at how those systems could be built and programming on them and those kinds of things. And that word apparently go locked in my brain. And now anytime anybody has talked about Simian army, I have heard Symbian.
So yeah, those Simian army, um, there are a few notable kind of functions in Simian Army and the, I guess most friendly would be chaos monkey, which pretty much will just shoot a server. They'll take down an EC2 instance. That's is quite a small place to start if you've got auto scaling turned on and...
As well, like what we've been talking about and it, it's mainly because like me bringing recent experiments that I've been running is we keep talking about like a system outage, like an EC2 goes down, there's, there's, that, but there's also just like an EC2 fails to respond or it has latency. So all the kinds of it can be components don't have to fail entirely. They can like fail partially as well. Right?
Yeah. So, um, the LinkedIn project, we'll get into that a little bit more. And I guess that the Netflix piece, because it's older was a bit more rudimentary at the time, like the concept of trying to deliberately break your production systems to check if your engineers had architected something for resilience. Um, it was pretty crazy then, like, can you imagine going to a board and saying, 'we want to break our customer facing website just to see if our engineers have built it properly'? I can't imagine most companies being okay with that.
Yeah. I think one of the things, again, again, it's like how anybody, anytime anybody talks about public cloud, my brain immediately goes, oh, we're talking about AWS. Right. Cause that's my bias. And that's what I build in. But, um, I think a lot of people who have organizations who test well and have kind of like bought into Chaos Engineering, or even if it's not necessarily Chaos, other kinds of testing, would be very at how many organizations are not. I can give you another example from another area that I work in. Um, I recently read a statistic about how few companies perform penetration testing. And that was incredibly like surprising to me the statistic, um, it was delivered in two parts and it's one of those where you read the first part. You're like, oh, that kind of makes sense. Then you read the second part.
And you're like, oh, what it said was 52% of large companies perform penetration testing. 13% of all companies perform penetration testing. So really what this statistic is trying to give you is like more than half, big companies do this, but when you account for small companies as well, it's that not very many people doing it. And I think that's the thing with testing is there's a huge number of companies out there who aren't doing really a great deal of testing, let alone like this cool testing stuff that we're talking about. Give you a funny example from the, from the pen testing side of things. And that was a company who had a main internet connection into their office. And then they had a backup internet connection for their office as well. The idea being if the main like goes down, they've got a much slower, smaller pipe, but they've still got some connectivity.
So you can imagine like broadband-pipe, dial-up-pipe kind of thing, but they've got this, this backup plan. What we actually found through testing was the backup line had no firewall. So the main line like this gigabit link goes into the, into the firewall, into the main office. And then if that link fails, they've got just like an, Any-Any, you just, clear clear internet access.
So that fails open.
It fails open! Absolutely. And the thing that interesting about that was they just never tested it and it's, you can look at someone like adversarial mechanisms of like, could somebody DoS the system in some way to cause it to failover to then use that, to get a security leverage? Like, yeah, that's fair enough. But in the context that we're looking at it, it's just like, you've never actually tested this, have you? It's like, they just never looked into it. So I think that is, you know, what we're talking about here is, for a lot of organizations, they're not doing any testing or certainly not like this level.
Yeah, absolutely. I think it comes back to the maturity conversation that we had earlier and it does take a particular kind of approach and mindset and definitely a culture that is open to absorbing and kind of accounting for failure.
And just being like, okay with it, isn't it, it's just like, like accepting the fact that like we should test these things and if something goes wrong, it's better to find out about it in a controlled environment than it is because productions down.
Well, absolutely. And that kind of links into a couple of different things. So Werner Vogel's is the CTO of AWS says that "Everything fails all the time" and it's like a, quite a, well, often-repeated quote of his.
So jumping back into talking about Netflix, then and the Simian army, we talked about Chaos Monkey. And you said that that was an easy way to get started. You kind of implied, that, that's just how they started. And that's one way in, and we talked about that being instance failure. So, you know, testing an EC2 goes down or something like that. What what's bigger than that, what's the next step.
Um, Chaos Gorilla is the next step up from that, which simulates an entire availability zone failure. And then there's one above that's called Chaos Kong. And that would simulate a region outage. I don't, I'm not sure if Chaos Gorilla is still available on GitHub Chaos Kong isn't. Obviously that would be incredibly destructive um material in the wrong hands. Accidentally destroy your entire company because the, you know, the, the BC and DR planning wasn't there and the architecture wasn't resilient enough.
One thing to jump in at that point as well is when we talk about maturity, one of the things that this means for organizations, b- whatever level we're talking about, we're talking about security, we're talking about resilience. Or whatever it's like, the organization needs to define what is their risk appetite in that space? Because you know, we're talking about here handling an instance, going down, handling an, an availability zone, going down or handling a region going down. It might be the case that for some organizations, depending on scales, budgets and tensions, you might not want to test for everything. You might not want to test to the level of like, Hey, what happens if a nuclear bomb hits a data center or something like that? There is um, an argument to say that organizations should specify, like not only the RTO/RPO, but, but just like the risk appetite in terms of like, how much are we willing to handle?