
Break and Fix Your AWS EKS Microservices with Chaos Monkey & FIS to test your system’s ability to withstand failure—on purpose. In a production environment, downtime isn’t a luxury.
Chaos Engineering forces you to ask the hard question: What happens when things break?
This guide takes you from theory to practice, helping you simulate failure in both Kubernetes and AWS-managed services using Kube‑Monkey and AWS Fault Injection Simulator (FIS).
Prerequisites
To successfully break and fix your AWS EKS microservices with Chaos Monkey & FIS, make sure you have:
- An AWS Account with access to EKS, RDS, ElastiCache, MSK, ACM, KMS, Route 53, and FIS
- A running EKS Cluster (use
eksctl create cluster
or the AWS Console) - Installed and configured tools:
kubectl
,helm
, andaws-cli
- An IAM role with fault injection permissions (see below)
Step 1: Deploy Kube‑Monkey in Your EKS Cluster
Kube‑Monkey simulates Chaos Monkey-style random pod terminations in Kubernetes.
This helps test how well your microservices recover from unexpected crashes.
Install using Helm:
helm repo add kube-monkey https://asobti.github.io/kube-monkey/charts/repo
helm repo update
helm install kube-monkey kube-monkey/kube-monkey –namespace kube-system
Verify deployment:
kubectl get pods -n kube-system -l app=kube-monkey
Opt-in Services for Chaos Testing
To participate, deployments must be explicitly labelled:
metadata:
labels:
kube-monkey/enabled: “enabled”
kube-monkey/mtbf: “1”
kube-monkey/kill-mode: “fixed”
kube-monkey/kill-value: “1”
This config tells Kube‑Monkey to kill one pod per day.
Step 2: Set Up IAM Role for AWS FIS
AWS FIS allows you to simulate real faults in AWS services. First, create a role (AWSFISExperimentRole
) with permissions like:
{
“Version”: “2012-10-17”,
“Statement”: [
{“Effect”: “Allow”, “Action”: [“rds:RebootDBInstance”], “Resource”: ““}, {“Effect”: “Allow”, “Action”: [“elasticache:TestFailover”], “Resource”: ““},
{“Effect”: “Allow”, “Action”: [“kafka:RebootBroker”], “Resource”: ““}, {“Effect”: “Allow”, “Action”: [“acm:UpdateCertificateOptions”], “Resource”: ““},
{“Effect”: “Allow”, “Action”: [“kms:DisableKey”], “Resource”: ““}, {“Effect”: “Allow”, “Action”: [“route53:ChangeResourceRecordSets”], “Resource”: ““}
]
}
Step 3: Create FIS Experiment Templates
Use AWS FIS to simulate service disruptions across key infrastructure components.
Examples:
- Reboot RDS instances to simulate DB crashes
- Trigger ElastiCache failovers to test the caching layer stability
- Restart MSK brokers for message queue resilience
- Disable ACM/KMS keys to test security and access controls
- KMS Key Disable
- Delete Route 53 records to validate DNS failure fallback
Each experiment is defined via a JSON template and launched via:
aws fis create-experiment-template –cli-input-json file://
Step 4: Monitor & Analyse Results
Monitoring Tools:
kubectl get pods --watch
- CloudWatch Dashboards for latency, throughput, and alarm triggers
- EFK stack or CloudWatch Logs for service-level diagnostics
Key Metrics:
- MTTD – Mean Time to Detect
- MTTR – Mean Time to Recover
- Error rate and latency trends
These insights help you break and fix your AWS EKS microservices effectively, ensuring they recover gracefully from failure.
Step 5: Automate Your Chaos Workflow
Chaos shouldn’t be a one-time stunt. Automate it via:
Tool | Use Case |
---|---|
CI/CD | Inject chaos after deployment |
EventBridge | Schedule weekly disruptions |
CloudWatch Alarms + SNS | Notify teams via Slack or email |
Chaos automation ensures your systems are continuously prepared, not just during testing cycles.
Why You Should Break and Fix Your AWS EKS Microservices with Chaos Monkey & FIS
- Reveal hidden single points of failure
- Improve service durability and uptime
- Build engineering confidence
- Reduce incident response time
- Strengthen observability and monitoring practices
Final Thought
To break and fix your AWS EKS microservices with Chaos Monkey & FIS is to truly prepare them for the real world.
Every system fails eventually—what matters is how quickly it recovers. With the right tools, strategy, and mindset, chaos becomes not a threat but a training ground for resilience.
Further Reading
- Kube‑Monkey GitHub: https://github.com/asobti/kube-monkey
- AWS FIS Docs: https://docs.aws.amazon.com/fis/latest/userguide/
- FIS RDS Action: https://docs.aws.amazon.com/fis/latest/APIReference/actions-supported.html#rds-actions
- FIS ElastiCache: https://docs.aws.amazon.com/fis/latest/APIReference/actions-supported.html#elasticache-actions
- FIS MSK: https://docs.aws.amazon.com/fis/latest/APIReference/actions-supported.html#kafka-actions
- FIS ACM & KMS: https://docs.aws.amazon.com/fis/latest/APIReference/actions-supported.html#acm-kms-actions
- FIS Route 53: https://docs.aws.amazon.com/fis/latest/APIReference/actions-supported.html#route53-actions
In today’s cloud-native environment, your ability to break and fix your AWS EKS microservices with Chaos Monkey & FIS could be the difference between downtime and resilience.
Contact Cloud Technology Hub for a strategy consultation, or subscribe to our newsletter for more tips.