Blog/Chaos Engineering: Designing Systems That Survive Failure
chaos-engineeringreliabilityresiliencedotnetgameday

Chaos Engineering: Designing Systems That Survive Failure

March 25, 2024·12 min read·by Bishwambhar Sen
Dashboard showing steady-state system metrics with a controlled failure injection spike and subsequent recovery curve

Concept

Netflix's Chaos Monkey, the tool that injected random instance failures into their production fleet, popularized the term "chaos engineering." But the Chaos Monkey was not about randomness — it was about making failure so common that the organization had to build systems that survived it. Random instance termination at Netflix meant teams had two choices: build resilient services or get paged at 3 AM every Tuesday. Most teams chose resilience.

Chaos engineering has since matured into a discipline with a formal definition (the Principles of Chaos Engineering, chaosengineering.com), a scientific methodology (hypothesis → experiment → observation → learning), and a rich tooling ecosystem (Chaos Toolkit, LitmusChaos, AWS Fault Injection Simulator).

The core proposition is simple: production systems contain failure modes that staging environments cannot reproduce. Rare network partitions, gradual memory leaks under sustained load, cascading timeout chains, connection pool exhaustion under burst traffic — these emerge from the interaction of real traffic patterns, real infrastructure behavior, and real operational conditions. The only way to discover them before your users do is to induce them deliberately in production, under controlled conditions, with defined rollback criteria.

The Principles of Chaos Engineering

The foundational methodology (formalized by the Netflix chaos team):

  1. Define steady state. What does "normal" look like? Quantified by observable metrics: requests per second, error rate, p99 latency, queue depth.
  2. Hypothesize that steady state will continue. The experiment begins with a falsifiable prediction: "Injecting 500ms latency on calls to the Inventory Service will not increase the Order Service error rate above 0.5%, because circuit breaker timeout is configured at 1 second."
  3. Introduce real-world events. Terminate instances, inject latency, exhaust connection pools, simulate disk I/O saturation.
  4. Disprove the hypothesis. If your hypothesis was correct, you've confirmed a resilience property. If incorrect, you've found a vulnerability before your users did.

The discipline is not about finding as many failures as possible. It is about systematically confirming specific resilience properties and discovering where they break.

GameDay Methodology

A GameDay is a structured, scheduled chaos experiment involving the full team — engineers, SREs, product managers, and sometimes stakeholders. It was formalized by Jesse Robbins at Amazon and adopted across the industry.

The GameDay structure:

Phase 1: Pre-experiment (1–2 weeks before)

  • Define the experiment hypothesis explicitly
  • Identify steady-state metrics and thresholds
  • Define blast radius: which services, which environments, which users may be affected
  • Define rollback criteria: what metric breach triggers immediate abort
  • Notify the on-call rotation and customer support teams
  • Verify rollback mechanisms are tested and ready

Phase 2: The experiment

  • Begin with the smallest possible blast radius (1 instance, 1% of traffic)
  • Expand only if steady state is maintained at each step
  • Real-time monitoring on a shared screen visible to all participants
  • One designated "abort button" holder with authority to stop immediately

Phase 3: Post-experiment

  • Blameless post-mortem documenting observations, surprises, and metric deviations
  • Prioritized remediation backlog for identified vulnerabilities
  • Updated runbooks reflecting discovered failure modes

Constraints

Blast Radius Containment: The Primary Safety Mechanism

The blast radius is the maximum user impact of the experiment if it goes wrong. Blast radius containment is not just a safety measure — it is what makes chaos engineering acceptable to product, legal, and customer success teams.

Blast radius dimensions:

  • Service scope: Which services are targeted? Start with internal services (no direct user impact), then platform services, then user-facing services last.
  • Instance scope: How many replicas are targeted? Start with 1 of N.
  • Traffic scope: What percentage of traffic is affected? Start with 1%.
  • Time scope: How long does the experiment run? Start with 5–10 minutes.
  • Customer scope: Can you exclude premium/enterprise customers from the blast radius using feature flags or routing rules?

The blast radius must be defined before the experiment begins — not decided during it. Once an experiment is running, cognitive load is high and judgment is compromised by urgency.

Steady-State Definition: The Metric That Must Not Break

The steady state is the quantified normal behavior of the system. Without a precise steady state definition, you cannot determine whether an experiment has violated it.

A complete steady-state definition for a .NET order service might be:

  • Order placement success rate: > 99.5% (5-minute rolling window)
  • Order placement p99 latency: < 500ms
  • Active circuit breaker openings: 0 per 5 minutes
  • Thread pool queue depth: < 50 requests
  • Memory usage: < 2.5GB per pod
  • Dead letter queue depth: < 100 messages

Every metric has a threshold. The experiment is aborted automatically when any threshold is breached.

Hypothesis Formulation: The Scientific Constraint

A weak hypothesis: "The system will handle database failures." This cannot be falsified because it has no quantified threshold.

A strong hypothesis: "Terminating 1 of 3 PostgreSQL read replicas will not increase the p99 read latency above 120ms (currently 65ms), because the connection pool automatically redistributes reads across remaining replicas within 3 seconds."

The strong hypothesis is testable, quantified, and includes the mechanism by which the system is expected to survive. If the experiment disproves it, you've learned something specific.

Trade-offs

Production vs. Staging Chaos

The purist position (Netflix's) is that chaos experiments must run in production to be meaningful, because staging environments do not reproduce production failure modes. The pragmatist position is that production chaos requires mature observability, tested rollback mechanisms, and organizational trust that take time to build.

The recommended progression:

  1. Staging environment chaos (no user impact, limited value, good for tool familiarity)
  2. Production chaos against non-critical services (internal services, batch jobs)
  3. Production chaos during low-traffic windows (2–6 AM local time)
  4. Production chaos during business hours (full confidence signal, maximum operational stress)

Most teams should spend 6–12 months at stages 1–2 before running stage 4.

Automated Chaos vs. Manual GameDays

Automated chaos (tools running experiments on a schedule without human oversight) provides continuous validation but requires very mature observability and automated rollback. A missed steady-state breach without a human abort mechanism can become a real incident.

Manual GameDays provide human judgment and organizational learning but run infrequently (quarterly at most). The realistic middle ground: automated chaos with automated steady-state monitoring and automatic abort, but a human on-call for every automated experiment window.

Mode Learning Value Safety Cadence Maturity Required
Manual GameDay High Highest Quarterly Medium
Automated + Human on-call High High Weekly High
Fully automated Medium Medium Continuous Very High

Code

The following shows a steady-state validator for a .NET Order Service — the core abstraction used to define and continuously evaluate whether the service is in a healthy state before, during, and after a chaos experiment:

// SteadyStateDefinition.cs — formal steady-state contract for chaos experiments
public class OrderServiceSteadyState
{
    private readonly IMetricsClient _metrics;
    private readonly ILogger<OrderServiceSteadyState> _logger;
    private readonly TimeSpan _evaluationWindow = TimeSpan.FromMinutes(5);

    public OrderServiceSteadyState(
        IMetricsClient metrics,
        ILogger<OrderServiceSteadyState> logger)
    {
        _metrics = metrics;
        _logger = logger;
    }

    public async Task<SteadyStateEvaluation> EvaluateAsync(
        CancellationToken cancellationToken = default)
    {
        var snapshot = await _metrics.GetSnapshotAsync(_evaluationWindow, cancellationToken);
        var violations = new List<string>();

        // Threshold 1: Success rate
        if (snapshot.OrderPlacementSuccessRate < 0.995)
            violations.Add($"Success rate {snapshot.OrderPlacementSuccessRate:P2} " +
                           $"< 99.5% threshold");

        // Threshold 2: p99 latency
        if (snapshot.OrderPlacementP99Ms > 500)
            violations.Add($"P99 latency {snapshot.OrderPlacementP99Ms:F0}ms > 500ms threshold");

        // Threshold 3: Circuit breaker state
        if (snapshot.OpenCircuitBreakerCount > 0)
            violations.Add($"{snapshot.OpenCircuitBreakerCount} circuit breaker(s) open");

        // Threshold 4: Thread pool pressure
        if (snapshot.ThreadPoolQueueDepth > 50)
            violations.Add($"Thread pool queue depth {snapshot.ThreadPoolQueueDepth} > 50");

        // Threshold 5: Memory pressure
        if (snapshot.MemoryUsageGb > 2.5)
            violations.Add($"Memory {snapshot.MemoryUsageGb:F1}GB > 2.5GB threshold");

        var isStable = violations.Count == 0;

        if (!isStable)
        {
            _logger.LogWarning(
                "Steady state VIOLATED: {Violations}",
                string.Join("; ", violations));
        }

        return new SteadyStateEvaluation
        {
            IsInSteadyState = isStable,
            Violations = violations,
            EvaluatedAt = DateTimeOffset.UtcNow,
            Snapshot = snapshot
        };
    }
}

// ChaosExperimentRunner.cs — orchestrates a controlled chaos experiment
public class ChaosExperimentRunner
{
    private readonly OrderServiceSteadyState _steadyState;
    private readonly IChaosAgent _chaosAgent;
    private readonly ILogger<ChaosExperimentRunner> _logger;

    public ChaosExperimentRunner(
        OrderServiceSteadyState steadyState,
        IChaosAgent chaosAgent,
        ILogger<ChaosExperimentRunner> logger)
    {
        _steadyState = steadyState;
        _chaosAgent = chaosAgent;
        _logger = logger;
    }

    public async Task<ExperimentResult> RunAsync(
        ChaosExperiment experiment,
        CancellationToken cancellationToken = default)
    {
        _logger.LogInformation(
            "Starting chaos experiment: {Name} | Hypothesis: {Hypothesis}",
            experiment.Name, experiment.Hypothesis);

        // Validate pre-experiment steady state — abort if already unhealthy
        var baselineEval = await _steadyState.EvaluateAsync(cancellationToken);
        if (!baselineEval.IsInSteadyState)
        {
            _logger.LogWarning(
                "Experiment ABORTED: system not in steady state before injection. " +
                "Violations: {Violations}", string.Join(", ", baselineEval.Violations));
            return ExperimentResult.AbortedUnhealthyBaseline(baselineEval);
        }

        // Inject the failure condition
        _logger.LogInformation("Injecting: {Action}", experiment.FailureAction);
        await _chaosAgent.InjectAsync(experiment.FailureAction, cancellationToken);

        ExperimentResult result;
        using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
        cts.CancelAfter(experiment.MaxDuration);

        try
        {
            // Monitor steady state during experiment
            while (!cts.Token.IsCancellationRequested)
            {
                await Task.Delay(TimeSpan.FromSeconds(30), cts.Token);
                var duringEval = await _steadyState.EvaluateAsync(cts.Token);

                if (!duringEval.IsInSteadyState)
                {
                    _logger.LogError(
                        "Steady state BREACH detected — aborting and rolling back. " +
                        "Hypothesis DISPROVED: {Hypothesis}",
                        experiment.Hypothesis);

                    result = ExperimentResult.HypothesisDisproved(duringEval);
                    goto Cleanup;
                }
            }

            result = ExperimentResult.HypothesisConfirmed(baselineEval);
            _logger.LogInformation(
                "Experiment complete. Hypothesis CONFIRMED: {Hypothesis}",
                experiment.Hypothesis);
        }
        catch (OperationCanceledException)
        {
            result = ExperimentResult.HypothesisConfirmed(baselineEval);
        }

        Cleanup:
        await _chaosAgent.RollbackAsync(experiment.FailureAction, cancellationToken);
        _logger.LogInformation("Chaos injection rolled back.");
        return result;
    }
}

The second example shows a resilience policy registry using Polly v8 — the standard patterns every .NET service participating in chaos engineering must have as prereqs:

// ResiliencePolicies.cs — Polly v8 resilience pipeline for chaos-ready .NET services
// These policies are what chaos experiments validate actually work under failure
using Polly;
using Polly.CircuitBreaker;
using Polly.Retry;
using Polly.Timeout;

public static class ResiliencePipelineRegistry
{
    /// <summary>
    /// Standard resilience pipeline for outbound HTTP calls.
    /// Chaos experiments target services using this pipeline to validate
    /// that circuit breakers open within the expected threshold.
    /// </summary>
    public static ResiliencePipeline<HttpResponseMessage> BuildHttpPipeline(
        string serviceName,
        ILogger logger)
    {
        return new ResiliencePipelineBuilder<HttpResponseMessage>()
            // Layer 1: Timeout — prevents indefinite blocking on slow dependencies
            .AddTimeout(new TimeoutStrategyOptions
            {
                Timeout = TimeSpan.FromMilliseconds(1000),
                OnTimeout = args =>
                {
                    logger.LogWarning(
                        "Timeout after {Timeout}ms calling {Service}",
                        1000, serviceName);
                    return ValueTask.CompletedTask;
                }
            })
            // Layer 2: Retry with exponential backoff — handles transient failures
            .AddRetry(new RetryStrategyOptions<HttpResponseMessage>
            {
                MaxRetryAttempts = 3,
                Delay = TimeSpan.FromMilliseconds(100),
                BackoffType = DelayBackoffType.Exponential,
                UseJitter = true, // Prevents thundering herd on retry storms
                ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                    .HandleResult(r => (int)r.StatusCode >= 500)
                    .Handle<HttpRequestException>()
                    .Handle<TimeoutRejectedException>(),
                OnRetry = args =>
                {
                    logger.LogWarning(
                        "Retry attempt {Attempt} for {Service} after {Delay}ms",
                        args.AttemptNumber, serviceName,
                        args.RetryDelay.TotalMilliseconds);
                    return ValueTask.CompletedTask;
                }
            })
            // Layer 3: Circuit breaker — prevents cascading failure by failing fast
            .AddCircuitBreaker(new CircuitBreakerStrategyOptions<HttpResponseMessage>
            {
                FailureRatio = 0.5,                           // Open at 50% failure rate
                MinimumThroughput = 10,                       // Need 10 calls to evaluate
                SamplingDuration = TimeSpan.FromSeconds(30),  // Over a 30-second window
                BreakDuration = TimeSpan.FromSeconds(15),     // Stay open for 15 seconds
                ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                    .HandleResult(r => (int)r.StatusCode >= 500)
                    .Handle<HttpRequestException>(),
                OnOpened = args =>
                {
                    logger.LogError(
                        "Circuit breaker OPENED for {Service} — failing fast for {Duration}s",
                        serviceName, 15);
                    return ValueTask.CompletedTask;
                },
                OnClosed = args =>
                {
                    logger.LogInformation(
                        "Circuit breaker CLOSED for {Service} — resuming calls",
                        serviceName);
                    return ValueTask.CompletedTask;
                }
            })
            .Build();
    }
}

Further Reading

External references: