Simulating outages for local cloud apps with LocalStack Chaos API
LocalStack Chaos API enables you to simulate outages in any AWS region or service. This blog guides you through setting up a cloud application locally and using the Chaos API to mimic service failures, helping ensure your application handles disruptions smoothly. Additionally, it discusses how to write chaos tests, providing a detailed strategy for testing resilient systems.
Introduction
Regardless of the precautions you take in developing your application and the reliability of cloud providers like AWS, incidents are unavoidable. They can be anything from as large-scale regional outage to as small as a request failing for reasons perhaps unknown. Common examples might include:
- Region-wide outages
- DNS failovers
- Service failures
- Network faults
The key is in building your application in such a way that it handles these situations gracefully. Chaos engineering is an approach that for testing these types of scenarios that can help you build a more resilient application.
Introducing the LocalStack Chaos API
LocalStack’s brand-new Chaos API provide an easy way to implement chaos engineering experiments to test a wide variety of simulated outages and failures within your application safely, without impacting your production users. All the testing scenarios described above can be executed within LocalStack, providing thorough coverage for critical situations in a matter of minutes rather than hours or days.
This blog will walk you through the process of setting up a cloud application on your local machine and leveraging the Chaos API to perform service failures in a local environment while using robust error handling to address and mitigate such issues. Furthermore, we will explore how to shift-left your chaos testing by integrating automated testing directly into your continuous integration workflow.
Prerequisites
- LocalStack Docker image &
LOCALSTACK_AUTH_TOKEN
- Docker Compose
- AWS CLI &
awslocal
wrapper - Maven 3.8.5 & Java 17
- Python &
pytest
framework cURL
Product Management System with Lambda, API Gateway, and DynamoDB
This demo sets up an HTTP CRUD API functioning as a Product Management System. The components deployed include:
- A DynamoDB table named
Products
. - Three Lambda functions:
add-product
for product addition.get-product
for retrieving a product.
- A locally hosted REST API named
product-api-gateway
. - API Gateway resource named
productApi
with additionalGET
andPOST
methods.
All resources can be deployed using a LocalStack Init Hook via the init-resources.sh
script in the repository.
To begin, clone the repository on your local machine:
Let’s create a Docker Compose configuration for simulating a local outage in the running Product Management System.
Set Up the Docker Compose
To start LocalStack and use the Chaos API, create a new Docker Compose configuration. You can find the official Docker Compose file for starting the LocalStack container in our documentation.
For an extended setup, include the following in your Docker Compose file:
- Include the
LOCALSTACK_HOST=localstack
environment variable to ensure LocalStack services are accessible from other containers. - Create the
ls_network
network to use LocalStack as its DNS server and enable the resolution of the domain name to the LocalStack container (also specify it viaLAMBDA_DOCKER_NETWORK
environment variable). - Add a new volume attached to the LocalStack container. This volume holds the
init-resources.sh
file, which is copied to the LocalStack container and executed when the container is ready. - Add another volume to copy the built Lambda functions specified as ZIP files during Lambda function creation.
- Optionally, add the
LAMBDA_RUNTIME_ENVIRONMENT_TIMEOUT
to wait for the runtime environment to start up, which may vary in speed based on your local machine.
The final Docker Compose configuration is as follows (also provided in the repository):
Deploy the local AWS infrastructure
Before deploying the demo application locally, build the Lambda functions to ensure they can be copied over during Docker Compose startup. Execute the following command:
The built Lambda function is now available at lambda-functions/target/product-lambda.jar
.
Start the Docker Compose configuration, which automatically creates the local deployment using AWS CLI and the awslocal
script inside the LocalStack container:
Check the Docker Compose logs to verify that LocalStack is running and your local AWS infrastructure is set up correctly. You should see the following output:
After deployment, use cURL
to create a product entity.
Execute the following command:
The output should be:
You can verify the successful addition by scanning the DynamoDB table:
The output should be:
Injecting Chaos in the local infrastructure
You can now use the Chaos API for chaos testing of your locally deployed infrastructure. You can access the Chaos API through the REST API at http://localhost.localstack.cloud:4566/_localstack/chaos/faults
, accepting standard HTTP requests.
To create an outage, taking down the DynamoDB table in the us-east-1
region, execute the following command:
The output should be:
This command creates an outage in the locally mocked us-east-1
DynamoDB tables. Verify by scanning the Products
table:
The output should be:
You can verify it in the LocalStack logs:
You can retrieve the current outage configuration using the following GET
request:
The output should be:
Error handling for the local outage
Now that the experiment is started, the DynamoDB table is inaccessible, resulting in the user being unable to retrieve or create products.
The API Gateway will return an Internal Server Error
. To prevent this, include proper error handling and a mechanism to prevent data loss during a database outage.
You can add a solution that includes an SNS topic, an SQS queue, and a Lambda function that picks up queued elements and retries the PutItem
operation on the DynamoDB table.
If DynamoDB is still unavailable, the item will be re-queued. The solution includes:
- A
process-product-events
Lambda for event processing and DynamoDB writes. - SNS topic named
ProductEventsTopic
and SQS queue namedProductEventsQueue
. - Subscription between the SQS queue and the SNS topic.
- Event source mapping between the SQS queue and the
process-product-events
Lambda function.
Run the following commands to create the necessary resources:
Test the solution by executing the following command:
The output should be:
To stop the outage, send a POST
request by using an empty list in the configuration.
The following request will clear the current configuration:
Now, scan the DynamoDB table and verify that the Super Widget
item has been inserted:
The output should be:
Automating Chaos Experiments using Pytest
You can now implement a straightforward chaos test using pytest
to start an outage.
The test will:
- Validate the availability of Lambda functions and the DynamoDB table.
- Start a local outage and verify if DynamoDB API calls throw an error.
- Validate the ongoing outage and its appropriate cessation.
- Query the DynamoDB table for new items and assert their presence.
For integration testing, you can use the boto3 and the pytest framework.
In a new directory named tests
, create a file named test_chaos.py
.
Add the necessary imports and pytest
fixtures:
Add the following code to perform a simple smoke test ensuring the availability of Lambda functions and the DynamoDB table:
Next, add the following helper functions to start, check, and stop the DynamoDB outage:
Now, add the following code to chaos test the locally deployed DynamoDB table:
Run the test locally using the following command:
The output should be:
Conclusion
We’ve seen how LocalStack’s Chaos API allowed us to quickly manually simulate a service outage to test our application’s response and then adjust it to handle this type of incident gracefully rather than returning an error to our end user. Using tools like PyTest, we can even leverage this API to create tests within automations that can help us ensure that future updates to our application are resilient to failures and outages.
In the upcoming blog posts, we’ll demonstrate how to perform more complex chaos testing scenarios, such as RDS & Route53 failovers, inject network latency to every API call, and use AWS Resilience Testing Tools such as AWS Fault Injection Service (FIS) locally. Stay tuned for more blogs on how LocalStack is enhancing your chaos engineering experience!
You can find the code in this GitHub repository.