Break and Fix Your AWS EKS Microservices with Chaos Monkey & FIS: 5 Proven Techniques

Break and Fix Your AWS EKS Microservices with Chaos Monkey & FIS: A Practical Guide to Resilient Cloud Systems

Break and Fix Your AWS EKS Microservices with Chaos Monkey & FIS to test your system’s ability to withstand failure—on purpose. In a production environment, downtime isn’t a luxury.

Chaos Engineering forces you to ask the hard question: What happens when things break?

This guide takes you from theory to practice, helping you simulate failure in both Kubernetes and AWS-managed services using Kube‑Monkey and AWS Fault Injection Simulator (FIS).

Prerequisites

To successfully break and fix your AWS EKS microservices with Chaos Monkey & FIS, make sure you have:

  • An AWS Account with access to EKS, RDS, ElastiCache, MSK, ACM, KMS, Route 53, and FIS
  • A running EKS Cluster (use eksctl create cluster or the AWS Console)
  • Installed and configured tools: kubectl, helm, and aws-cli
  • An IAM role with fault injection permissions (see below)

Step 1: Deploy Kube‑Monkey in Your EKS Cluster

Kube‑Monkey simulates Chaos Monkey-style random pod terminations in Kubernetes.

This helps test how well your microservices recover from unexpected crashes.

Install using Helm:

helm repo add kube-monkey https://asobti.github.io/kube-monkey/charts/repo
helm repo update
helm install kube-monkey kube-monkey/kube-monkey –namespace kube-system

Verify deployment:

kubectl get pods -n kube-system -l app=kube-monkey

Opt-in Services for Chaos Testing

To participate, deployments must be explicitly labelled:

metadata:
labels:
kube-monkey/enabled: “enabled”
kube-monkey/mtbf: “1”
kube-monkey/kill-mode: “fixed”
kube-monkey/kill-value: “1”

This config tells Kube‑Monkey to kill one pod per day.

Step 2: Set Up IAM Role for AWS FIS

AWS FIS allows you to simulate real faults in AWS services. First, create a role (AWSFISExperimentRole) with permissions like:

{
“Version”: “2012-10-17”,
“Statement”: [
{“Effect”: “Allow”, “Action”: [“rds:RebootDBInstance”], “Resource”: ““}, {“Effect”: “Allow”, “Action”: [“elasticache:TestFailover”], “Resource”: ““},
{“Effect”: “Allow”, “Action”: [“kafka:RebootBroker”], “Resource”: ““}, {“Effect”: “Allow”, “Action”: [“acm:UpdateCertificateOptions”], “Resource”: ““},
{“Effect”: “Allow”, “Action”: [“kms:DisableKey”], “Resource”: ““}, {“Effect”: “Allow”, “Action”: [“route53:ChangeResourceRecordSets”], “Resource”: ““}
]
}

Step 3: Create FIS Experiment Templates

Use AWS FIS to simulate service disruptions across key infrastructure components.

Examples:

  • Reboot RDS instances to simulate DB crashes
  • Trigger ElastiCache failovers to test the caching layer stability
  • Restart MSK brokers for message queue resilience
  • Disable ACM/KMS keys to test security and access controls
  • KMS Key Disable
  • Delete Route 53 records to validate DNS failure fallback

Each experiment is defined via a JSON template and launched via:

aws fis create-experiment-template –cli-input-json file://

Step 4: Monitor & Analyse Results

Monitoring Tools:

  • kubectl get pods --watch
  • CloudWatch Dashboards for latency, throughput, and alarm triggers
  • EFK stack or CloudWatch Logs for service-level diagnostics

Key Metrics:

  • MTTD – Mean Time to Detect
  • MTTR – Mean Time to Recover
  • Error rate and latency trends

These insights help you break and fix your AWS EKS microservices effectively, ensuring they recover gracefully from failure.

Step 5: Automate Your Chaos Workflow

Chaos shouldn’t be a one-time stunt. Automate it via:

ToolUse Case
CI/CDInject chaos after deployment
EventBridgeSchedule weekly disruptions
CloudWatch Alarms + SNSNotify teams via Slack or email

Chaos automation ensures your systems are continuously prepared, not just during testing cycles.

Why You Should Break and Fix Your AWS EKS Microservices with Chaos Monkey & FIS

  • Reveal hidden single points of failure
  • Improve service durability and uptime
  • Build engineering confidence
  • Reduce incident response time
  • Strengthen observability and monitoring practices

Final Thought

To break and fix your AWS EKS microservices with Chaos Monkey & FIS is to truly prepare them for the real world.

Every system fails eventually—what matters is how quickly it recovers. With the right tools, strategy, and mindset, chaos becomes not a threat but a training ground for resilience.

Further Reading

In today’s cloud-native environment, your ability to break and fix your AWS EKS microservices with Chaos Monkey & FIS could be the difference between downtime and resilience.

Contact Cloud Technology Hub for a strategy consultation, or subscribe to our newsletter for more tips.

7 Kubernetes Security Best Practices: How to Securely Set Up and Harden Your Cluster

Kubernetes Security Best Practices: Hardening Your Cluster

Following Kubernetes security protocols is essential to ensuring your cluster remains secure against potential threats. By following these Kubernetes Security Best Practices, you can securely set up and harden your Kubernetes environment to protect your applications and data.

Kubernetes Security Best Practices for Safe Cluster Deployment

Implementing Kubernetes Safety Best Practices is crucial for safeguarding your infrastructure.

Understanding Cluster Security Best Practices will help you make informed decisions during deployment.

When deploying a Kubernetes cluster, choosing the right method is crucial for both security and scalability. Here are the main options to consider:

  • Managed Kubernetes (EKS, GKE, AKS): Cloud providers handle the security of the control plane and provide automatic updates.
  • Kubeadm: A more hands-on approach, using kubeadm for bootstrapping clusters on your own virtual machines (VMs).
  • kOps / Kubespray: These infrastructure-as-code solutions are suitable for production-grade clusters.

Recommendation: If you’re new to Kubernetes, starting with kubeadm on hardened VMs can help you understand the core components of cluster security.

Securing the Kubernetes Host OS: Critical Security Steps

Following Kubernetes Security Best Practices ensures that you are prepared to manage potential risks.

Recognising and applying Kubernetes Security Steps is vital throughout your cluster’s lifecycle.

Before setting up Kubernetes, securing the host operating system is essential. Below are a few key actions:

  • Use a Minimal OS: Choose Ubuntu LTS or CentOS Stream to minimise unnecessary packages.
  • Enable Automatic Updates: Keep the system updated automatically to patch security vulnerabilities.

sudo apt-get install unattended-upgrades
sudo dpkg-reconfigure –priority=low unattended-upgrades

Disable Unnecessary Services: Turn off services like swap, cloud-init, and any non-essential daemons.

Kernel Hardening: Update system parameters to protect your environment:

cat <<EOF | sudo tee /etc/sysctl.d/99-k8s.conf
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-iptables = 1
fs.protected_regular = 1
fs.protected_fifos = 1
EOF
sudo sysctl –system

Install a Host Firewall: Use firewalld or UFW to limit access to the required ports only.

Hardening the Kubernetes Control Plane

Once the OS is secured, it’s important to harden the control plane itself. Follow these steps:

Addressing security through Kubernetes Security Measures strengthens your overall cloud security posture.

  • Audit Logging: Track all API activity for transparency and security:

apiVersion: audit.k8s.io/v1
kind: Policy
rules:

  • level: Metadata
    resources:
  • group: “”
    resources: [“pods”,”secrets”,”configmaps”]

Encryption at Rest: Enable encryption for sensitive data such as secrets:

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:

  • resources:
    • secrets
      providers:
    • aescbc:
      keys:
      – name: key1
      secret:

Deploying a Network Plugin with Security Policies

Using a CNI that supports NetworkPolicies, such as Calico, helps enforce security at the network level. Here’s how to apply it:

kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

After deploying, set up a default-deny policy to limit traffic:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: default
spec:
podSelector: {}
policyTypes:

  • Ingress
  • Egress

Configuring RBAC and Pod Security

To ensure a least-privilege model, configure Role-Based Access Control (RBAC) and enforce Pod Security Standards:

  • Disable Anonymous Access: Make sure the kube-apiserver is configured with --anonymous-auth=false.
  • Create Specific Roles: Define roles with minimum required permissions and bind them to service accounts:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: pod-reader
rules:

Utilising Kubernetes Hardening Strategies allows you to enforce consistent security policies.

  • apiGroups: [“”]
    resources: [“pods”]
    verbs: [“get”,”watch”,”list”]

Additionally, implement PodSecurityAdmission to enforce security policies at the namespace level:

kubectl label namespace production pod-security.kubernetes.io/enforce=restricted

Securing Container Images and Supply Chain Integrity

Utilise a private container registry and enable vulnerability scanning to secure your images.

Enforce image signing using Cosign:

cosign sign –key cosign.key myregistry/myapp:latest

Additionally, prevent the use of unscanned or “latest” tags by using an admission controller like OPA/Gatekeeper.


Monitoring and Intrusion Detection

Use Prometheus and Grafana for monitoring your cluster’s health.

For real-time intrusion detection, deploy Falco:

helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco

Also, run Kube-bench and Kube-hunter for periodic security assessments.

Centralised Logging and Audit Collection

Centralising logs allows for more effective monitoring and incident response. Forward kube-apiserver, kubelet, and container logs to a central logging solution like ELK or a managed service.

Use Fluentd or Filebeat to ship audit logs for real-time analysis.

Incorporating Kubernetes Cluster Security Guidelines is essential for continuous compliance and risk management.

Regular Maintenance and Updates

To maintain the security of your cluster, it’s important to regularly update and maintain it:

  • Rotate Certificates and Tokens: Rotate them every 90 days to minimise the risk of compromise.
  • Apply Patches Promptly: Upgrade Kubernetes to the latest patch release within 30 days of publication.
  • Conduct Regular CIS Benchmarks: Assess your cluster quarterly using CIS Kubernetes Benchmarks to ensure compliance with best practices.

Conclusion

By following these Kubernetes security measures, you can build and maintain a secure cluster environment.

Kubernetes offers powerful tools for application delivery, but it is crucial to implement these best practices to safeguard your infrastructure from threats and vulnerabilities.


Further Reading:

Contact Cloud Technology Hub for a strategy consultation, or subscribe to our newsletter for more tips.