Stop FIS’ing about your outages — be prepared with AWS Fault Injection Simulator
The cloud has evolved to provide an even increasing array of services for us to build our applications from. It’s not uncommon now to have applications built from a plethora of Hyperscaler native services (e.g. API Gateway, Lambda, EventBridge, DynamoDB), or Hyperscaler agnostic cloud native services in containers running on Kubernetes clusters. These components can be spread across multiple availability zones (AZs) within a region, or even spread across multiple regions to service users at lower latency or for ultra high availability. Even where our application is running on virtual machines (VM) we have the option to trade-off known availability of that VM (e.g. EC2 standard instances) with unknown availability for a reduced cost (e.g. EC2 Spot Instances).
We have infrastructure as code (IaC) to be able to deploy component consistently across multiple AZs or regions. But with these geographic placement options comes other considerations — some services are regional by nature (e.g. SNS, EventBridge, S3), whilst some are AZ based (e.g. EC2, even RDS has an active node is in one AZ whilst a passive node can be in another). Being able to test failure scenarios and relate them back to your desired Recovery Point Objective (RPO) and Recovery Time Objective (RTO) can be a challenge.
Chaos Engineering was designed to help test applications, allowing you to observe how they react to failure scenarios. Netflix were a pioneer in this area, developing the ‘Chaos Monkey’ to simulate local failures like instances, and later expanding this to include ‘Chaos Gorilla’ to simulate the loss of entire AZs and ‘Chaos Kong’ to simulate the loss of entire regions. I’m sure we’ve all worked on teams that had a human equivalent of the Chaos Monkey, Gorilla or Kong, who provided the capability as a by product of their actual day jobs! Clearly being able to test failure, observe behaviour and react to and improve the resilience before that event happens in the real world is highly desirable. For AWS customers having this available ‘as a service’ is even better…. enter AWS Fault Simulator (FIS).
The scenario I wanted to test involves EC2 Spot Instances. I’ve really got into competing in AWS Deep Racer, which for serious use without breaking the bank really requires you to go outside of the AWS DeepRacer Console experience (subject of another blog post here). To get the biggest ‘bang for my buck’ in training for I needed to use EC2 Spot Instances. JP Morgan Chase’s team have developed an excellent open source repo for training (called DeepRacer On The Spot), which itself is a wrapper around the legend that is Lars Lorentz Ludvigsen’s Deepracer For Cloud repo However, the minimum variable product (MVP) version they released and I’d been using for a few months had one major drawback for Spot Instance training — when the Spot Instance was terminated earlier than planned (e.g. because AWS reclaimed the capacity or the price went above a level I was willing to pay) the training didn’t later resume. This could result in hours of ‘lost’ training time if the interruption happened and I didn’t notice due to being asleep, being busy at weekends etc.
I decided to contribute to this project by improve this architecture, moving the single EC2 Spot Instance to an Autoscaling Group. The architecture requires that a single EC2 Spot Instance is running (i.e. max and desirable number of instances as 1). Without going into the detail of this particular application it wasn’t sufficient to simply start the next instance when the previous one was terminated. This would result in initiating the training from the beginning, potentially losing hours or days of valuable training time and lost money in the process. Upon termination I needed the terminating instance to prepare the configuration files so the next instance that started resumed the training from the point where the terminating one finished, regardless of whether that happened 30 minutes or two days into the training. I couldn’t simply wait an undetermined amount of time for an EC2 Spot interruption to happen, and simply terminating the instance in the console doesn’t provide the exact same behaviour, as termination starts immediately, whereas Spot interruption gives you a 2 minute warning to get your house in order. Although this isn’t a highly complicated requirement it’s a perfect use case for FIS.
Although FIS is supported by the CLI and IaC tools like Terraform and CloudFormation I chose to use the console. It was a nice easy experience to undertake and I wanted to learn more about the service. Of course if you have highly complicated test scenarios, or you want to run them across multiple regions etc. then you’re going to want to create these in code for them to be repeatable, for example as part of a CICD test scenario.
Initiating the interruption was simple: -
With the option to use a built in service role or choose your own: -
The test would then run and confirm the results: -
By using this multiple times I was able to test accurately the actual scenario that would be faced during a real EC2 Spot Interruption. This allowed me to test the EventBridge configuration that was listening the for the events, along with the application’s reaction. I could test what happen when multiple interruptions happened during the life of the ASG, and could tune the code to be able to continue regardless of their being multiple interruption. All this could be achieved over a few hours, testing, iterating code and retesting, instead of having unintended consequences or having to wait for less frequent events to occur to test whether what I’d designed had the consequences that were desirable.
I found FIS very useful, even for my simple use case. You can build up much more complicated test scenarios. At the time of writing (April 2023) there are 28 different AWS scenarios you can use to simulate a failure. These range from binary EC2 Spot interruptions, terminations, reboots, to degradation scenarios like reducing ECS capacity, simulating API errors or throttling, I/O stress testing, networking interruptions and more. You can also use it as a framework to test whatever you want on instances and containers by injecting your own code in via SSM Docs or custom resources into EKS. Subnet availability can be tested, although from the documentation it appears this simply denies traffic to the subnets. If that’s the case it will likely have the same limitations of not failing over Transit Gateway Attachments, FSx Clusters and the like that I described when doing an SAP disaster recovery test in a previous blog. I’d need to find time to test that again….
The service is priced at $0.10/minute, or $0.12/minute in GovCloud. This seems like ‘value based pricing’ to me given the cost avoidance or recovering from a real outage that could take time to investigate and find the cause, or being able to maintain revenue instead of losing bookings, sales etc.