Using the Netflix Simian Army: An Interview with Gareth Bowles


Gareth Bowles of Netflix talks about his upcoming presentation at STARWEST 2014, how the cloud is all about fault tolerance, how his time at several Sillicon Valley start-ups has prepared him for joining Netflix, and how his career turned him from a developer into a tester. 

Gareth Bowles of Netflix talks about his upcoming presentation at STARWEST 2014, how the cloud is all about fault tolerance, how his time at several Sillicon Valley start-ups has prepared him for joining Netflix, and how his career turned him from a developer into a tester. 


Cameron Philipp-Edmonds: Today we have Gareth Bowles, and he'll be speaking at STARWEST 2014, which is October 12 through October 17, and he is giving a presentation entitled "Release the Monkeys: Testing using the Netflix Simian Army." To start things off, Gareth, can you tell us a litlte bit about your role at Netflix?

Gareth Bowles: Sure, yeah. Thanks for having me. I work on the team at Netflix called Engineering Tools, where we're responsible for the productivity of our developers; we have about 700 engineers in total, and my team provides build and deployment automation, all the way from the source control all the way up to deployment and to Amazon web services.

We try to make the build and deployment as transparent as possible for our developers so they can get on with what they're good at, which is developing new features for Netflix.

Cameron Philipp-Edmonds: Okay, cool. You talk about cloud, and you're talk is that cloud is all about redundancy and fault tolerance. Out of all the things I've heard people say about cloud, I've never really heard anyone say it's all about fault tolerance. Normally people say the cloud is all about magical storage, making storage virtualization dreams come true, or that they really don't even know what the cloud actually is.

Why do you look at the cloud as a test in fault tolerance?

Gareth Bowles: Good question, yeah. Cloud is a very nebulous term, right? It means a lot of different things to different people. What I'm really getting at here is that designing your application to run in a cloud environment is all about fault tolerance because you can't assume that the platform is going to be 100% available. No matter how good your architecture is, you can't guarantee 100% at the time of your client service.

For instance, Amazon web services has had various well publicized outages that took down major websites. All hardware is eventually going to fail, too, even if you're running on your own hardware, so you have to assume any component of your app can fail at any time, and then design your apps to handle the failure as transparently as you can so that your users are not impacted.

Cameron Philipp-Edmonds:  Then you mentioned something pretty cool there. You mentioned that no component can guarantee 100% up time. With that being said, what really is a reasonable up time to be expected for a developing team, and then also, for a consumer? Are those expectations the same?

Gareth Bowles: I don't think they are, no. I don't think so, and to answer that question, it depends on what type of app you're developing, if you have any legal requirements for up time, for instance, and at the end of the day, what your customer's expectations are for the up time. I'd say in general, I'd expect a production system to have much higher uptime requirements than the developmental test system.

Cameron Philipp-Edmonds: Like you said earlier, you work with the Netflix Simian Army. Can you tell us little bit about what the philosophy is behind it?

Gareth Bowles: Building on what I was saying about designing for fault tolerance, the simian army is designed to make sure that our fault tolerant architecture actually works by introducing different types of failure but doing it in a controlled way. Rather than having to wake somebody up on a pager at 3 a.m. on a Sunday, we can do it while engineers are standing by to address any problems and run it in a scheduled way that enables us to learn from the problems we find and, in most cases, build automatic recovery mechanisms so that when we do get that failure at 3 a.m. on Sunday, nobody even notices it.

Cameron Philipp-Edmonds: Okay, and you covered the philosophy a little bit, can you briefly introduce the main members, main components, of Netflix's Simian Army?

Gareth Bowles: Sure, yes. Chaos Monkey was the one that got it all started. That monkey randomly disables AWS instances, which is one of the most common types of failure that you'll get in the cloud, just an instance goes away. To make sure that we can survive that common failure without any kind of customer impact, we used to run Chaos Monkey as a controlled experiment with engineers standing by to fix problems, but we're now comfortable enough that we make it the default for production and teams have to actually explicitly opt out of Chaos Monkey if they don't want it to go into instances in production.

Cameron Philipp-Edmonds: Okay. Why is it called Chaos Monkey?

Gareth Bowles: Really, because it's introducing chaos. Hopefully, theoretically it's introducing chaos, but if you design your apps right, then you get your fault tolerance correct, then there won't be any chaos. I guess we're trying to, we're testing for the absence of chaos is the idea is all. We have Latency Monkey. That one introduces artificial delays in communications between services. Netflix runs on a distributed service architect where we've got hundreds of little microservices all talking to each other to make up the Netflix streaming experience.

We can even make very large delays with Latency Monkey and simulate a complete service outage without actually bringing the instances down, so it's kind of an easy way to test a service outage.

And we have Conformity Monkey. That one finds instances that don't adhere to the best practices. For instance, if they're not a member of an autoscaling group, an AWS, then it will shut them down in order to give the service the chance to relaunch them properly. We have Janitor monkey. That one eliminates clutter and waste by removing unused resources like unattached DBS volumes and AMIs unassociated with running instances so that keeps our costs down and makes our environment easier to navigate by eliminating pointless resources.

Then we have some extensions of Chaos Monkey that we brought in fairly recently called Chaos Gorilla, and Chaos Kong, so those are going up in scale. Chaos Gorilla simulates the outage of an entire availability zone in the AWS, which has actually happened a couple of times. Chaos Kong goes one step further and simulates an entire region going out, a region that's made up of multiple availability zones. We use those on a scheduled basis. We're not running those all the time in production. We test our abilities to either automatically rebounce between availability zones that are still there, or completely flow over to a different region, in the case of Chaos Kong.

About the author

Upcoming Events

Nov 05
Nov 14
Dec 05
Jun 03