Blog/Implementing Circuit Breakers in .NET: Beyond the Basics
circuit-breakerdotnetpollyresiliencedistributed-systems

Implementing Circuit Breakers in .NET: Beyond the Basics

January 20, 2024·13 min read·by Bishwambhar Sen
A state transition diagram showing the three circuit breaker states — Closed, Open, and Half-Open — with failure threshold and probe timer annotations

Concept

The circuit breaker pattern, popularised by Michael Nygard in Release It!, is a stability pattern that prevents a service from repeatedly attempting an operation that is likely to fail. The name comes from electrical engineering: a physical circuit breaker trips when current exceeds a threshold, interrupting the circuit and preventing damage. In software, the circuit breaker trips when a dependency (a downstream service, a database, an external API) exceeds a failure threshold, and it stops calling that dependency for a configurable period.

The pattern's value is not in preventing individual failures — retry policies handle that. Its value is in cascade failure prevention. Without a circuit breaker, a slow or failing downstream service will exhaust your thread pool, increase response times, and propagate failures upstream to your callers. With a circuit breaker, calls to the failing dependency fail fast (immediately returning a cached response or an error), freeing threads and allowing your service to remain healthy while the downstream recovers.

The circuit breaker operates as a three-state machine:

Closed state: The circuit is operating normally. Calls pass through to the dependency. Failures are counted within a sliding window. When the failure rate exceeds the configured threshold (e.g., 50% failure rate over the last 60 seconds), the circuit trips to Open.

Open state: The circuit has tripped. All calls to the dependency fail immediately without attempting the actual call. A timer (the break duration) determines how long the circuit stays Open. This is not a configurable "wait and retry" — it is a deliberate fast-fail that protects the upstream service and gives the downstream time to recover.

Half-Open state: After the break duration expires, the circuit transitions to Half-Open. A limited number of probe requests are allowed through to test whether the dependency has recovered. If the probes succeed, the circuit closes. If they fail, the circuit returns to Open and resets the break timer (often with exponential backoff).

The Half-Open probe logic is where most production implementations have subtleties that matter.

Constraints

The Sliding Window vs. Count-Based Threshold Problem

A naive circuit breaker trips when it observes N consecutive failures. This is problematic for high-throughput services. Consider a service making 10,000 calls per minute to a dependency. If the dependency degrades (say, 30% failure rate), you want the circuit to trip quickly — not after waiting for N consecutive failures, which might never happen if the errors are interspersed with successes.

Modern implementations use a sliding window approach with a dual threshold: trip when the failure rate exceeds a percentage threshold AND the minimum number of calls in the window exceeds a count threshold. Polly's v8 circuit breaker (in the Microsoft.Extensions.Resilience package) implements this as failureRatio + minimumThroughput. A failureRatio of 0.5 and minimumThroughput of 100 means: "trip only if at least 100 calls have been made in the last sampling window, and at least 50% of them failed."

Without minimumThroughput, a single failed cold-start call could trip the circuit at system startup.

Half-Open Probe Concurrency and the Thundering Herd

When a circuit transitions from Open to Half-Open, it allows a limited number of probe requests through. The danger is concurrency: if 50 threads are waiting for the circuit to open, and it transitions to Half-Open allowing 1 probe, 49 threads must still wait — but for how long? If the waiting threads time out while the probe is in flight, they accumulate resources. If the circuit allows too many probes, a fragile recovering dependency can be re-overwhelmed.

The canonical approach is to allow exactly 1 probe at a time in Half-Open, using a semaphore. All other calls during the probe window fail fast (same as Open behaviour). This is what Polly's implementation does. The probe must be a realistic representative call — not a health check endpoint. A dependency whose /health endpoint is fast may still time out on its actual query path. Probes must exercise the actual dependency path.

Cascading Circuit Breaker Topology

In a service mesh of 15 microservices, every inter-service call should have a circuit breaker. When Service A calls Service B, which calls Service C, you have a layered breaker topology. If C's circuit trips, B's calls to C start failing fast. If B's failure rate from C's failures exceeds B's circuit breaker threshold at its own level, B's circuit for Service A may trip too. This cascade is intended — it surfaces the root cause (C's failure) through the topology. The operational challenge is ensuring that your observability layer captures which circuit breaker tripped, in which direction, and at what timestamp.

Bulkhead Isolation and Circuit Breakers Are Complementary

A circuit breaker alone does not prevent thread pool exhaustion during the time before it trips. The bulkhead pattern — which limits the number of concurrent calls to a dependency — complements the circuit breaker by capping the blast radius. A bulkhead of 20 concurrent calls to Service C means at most 20 threads can be blocked waiting for C, even before the circuit breaker trips. After the circuit trips, those 20 calls fail fast, freeing threads immediately.

Trade-offs

Breaking Too Aggressively vs. Not Aggressively Enough

A circuit breaker tuned too sensitively (low failure rate threshold, short sampling window) will trip on transient errors — a single GC pause causing a handful of timeouts is enough to take down the circuit for 30 seconds. This false positive degrades your service unnecessarily.

A circuit breaker tuned too conservatively (high failure rate threshold, long sampling window) will not trip until significant damage has been done — your thread pool is already degraded by the time the breaker opens.

The calibration approach: start with P95 latency data for your dependency under normal load. Set timeout at 2× P95. Set the sampling window to 60 seconds. Set the failure rate threshold to 50%, minimumThroughput to match your expected RPS × window duration × 0.1 (10% of normal traffic). Tune from there based on observed false-positive rates.

Polly v7 vs. v8 API and Microsoft.Extensions.Resilience

Polly v7 (the Polly NuGet package) uses Policy.Handle<>().CircuitBreaker(). Polly v8 (Polly + Microsoft.Extensions.Resilience) uses pipeline builders with AddCircuitBreaker() and integrates natively with IHttpClientFactory and the ASP.NET Core DI container. For greenfield .NET 8 projects, the v8 API with AddResiliencePipeline is the correct choice — it provides built-in metrics via System.Diagnostics.Metrics, structured logging, and configuration binding.

Code

The following demonstrates a production-grade circuit breaker configuration using Polly v8 with the Microsoft.Extensions.Resilience package, including custom outcome predicates and a half-open probe callback.

public static class InventoryServiceResilienceExtensions
{
    public static IHttpClientBuilder AddInventoryServiceResilience(
        this IHttpClientBuilder builder,
        IConfiguration configuration)
    {
        var options = configuration
            .GetSection("Resilience:InventoryService")
            .Get<InventoryResilienceOptions>()
            ?? InventoryResilienceOptions.Default;

        return builder.AddResilienceHandler("inventory-pipeline", pipelineBuilder =>
        {
            // Layer 1: Bulkhead — cap concurrent calls to prevent thread pool exhaustion
            pipelineBuilder.AddConcurrencyLimiter(new ConcurrencyLimiterStrategyOptions
            {
                PermitLimit = options.MaxConcurrentCalls,
                QueueLimit = options.MaxQueuedCalls
            });

            // Layer 2: Timeout — define what "slow" means before measuring failure rate
            pipelineBuilder.AddTimeout(TimeSpan.FromMilliseconds(options.TimeoutMs));

            // Layer 3: Circuit Breaker — trip on sustained failure rate
            pipelineBuilder.AddCircuitBreaker(new CircuitBreakerStrategyOptions
            {
                // Trip if failure rate exceeds 50% over the sampling window
                FailureRatio = options.FailureRateThreshold,
                // ...but only if at least 10 calls were made in that window
                MinimumThroughput = options.MinimumThroughput,
                // Sampling window duration
                SamplingDuration = TimeSpan.FromSeconds(options.SamplingWindowSeconds),
                // How long to stay Open before testing recovery
                BreakDuration = TimeSpan.FromSeconds(options.BreakDurationSeconds),
                // Only count 5xx and network errors — not 4xx (those are caller bugs)
                ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                    .Handle<HttpRequestException>()
                    .Handle<TimeoutRejectedException>()
                    .HandleResult(r => (int)r.StatusCode >= 500),
                OnOpened = args =>
                {
                    var logger = args.Context.ServiceProvider
                        .GetRequiredService<ILogger<InventoryServiceCircuitBreakerMarker>>();
                    logger.LogWarning(
                        "CIRCUIT OPEN: InventoryService circuit breaker tripped. " +
                        "BreakDuration={BreakDuration}s. Last outcome: {Outcome}",
                        options.BreakDurationSeconds,
                        args.Outcome.Exception?.Message ?? args.Outcome.Result?.StatusCode.ToString());
                    return ValueTask.CompletedTask;
                },
                OnHalfOpened = args =>
                {
                    var logger = args.Context.ServiceProvider
                        .GetRequiredService<ILogger<InventoryServiceCircuitBreakerMarker>>();
                    logger.LogInformation(
                        "CIRCUIT HALF-OPEN: Sending probe request to InventoryService");
                    return ValueTask.CompletedTask;
                },
                OnClosed = args =>
                {
                    var logger = args.Context.ServiceProvider
                        .GetRequiredService<ILogger<InventoryServiceCircuitBreakerMarker>>();
                    logger.LogInformation(
                        "CIRCUIT CLOSED: InventoryService has recovered. Normal operation resumed.");
                    return ValueTask.CompletedTask;
                }
            });

            // Layer 4: Retry — limited retries INSIDE the circuit breaker
            // Retries only fire when the circuit is Closed or Half-Open (probe succeeded)
            pipelineBuilder.AddRetry(new RetryStrategyOptions
            {
                MaxRetryAttempts = 2,
                BackoffType = DelayBackoffType.Exponential,
                UseJitter = true,
                Delay = TimeSpan.FromMilliseconds(100),
                ShouldHandle = new PredicateBuilder()
                    .Handle<HttpRequestException>()
                    .Handle<TimeoutRejectedException>()
            });
        });
    }
}

// Marker class for scoped logger injection (avoids generic type pollution in logs)
internal sealed class InventoryServiceCircuitBreakerMarker { }

The second pattern is a circuit breaker state monitor that exposes the current circuit state to a health check endpoint — essential for operational visibility.

public class CircuitBreakerHealthContributor : IHealthContributor
{
    private readonly ResiliencePipelineProvider<string> _pipelineProvider;
    private readonly ILogger<CircuitBreakerHealthContributor> _logger;

    private static readonly string[] TrackedPipelines =
        { "inventory-pipeline", "payment-pipeline", "notification-pipeline" };

    public CircuitBreakerHealthContributor(
        ResiliencePipelineProvider<string> pipelineProvider,
        ILogger<CircuitBreakerHealthContributor> logger)
    {
        _pipelineProvider = pipelineProvider;
        _logger = logger;
    }

    public string Id => "circuit-breakers";

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken)
    {
        var circuitStates = new Dictionary<string, object>();
        var hasOpenCircuit = false;

        foreach (var pipelineName in TrackedPipelines)
        {
            try
            {
                var pipeline = _pipelineProvider.GetPipeline(pipelineName);

                // Attempt to extract CircuitBreakerStateProvider from the pipeline
                // (available through ResilienceContext telemetry in Polly v8)
                var circuitBreakerState = ExtractCircuitBreakerState(pipeline, pipelineName);

                circuitStates[pipelineName] = circuitBreakerState;

                if (circuitBreakerState is "Open" or "HalfOpen")
                {
                    hasOpenCircuit = true;
                    _logger.LogWarning("Circuit {Pipeline} is in state {State}",
                        pipelineName, circuitBreakerState);
                }
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Failed to inspect circuit breaker state for {Pipeline}", pipelineName);
                circuitStates[pipelineName] = "Unknown";
            }
        }

        return hasOpenCircuit
            ? HealthCheckResult.Degraded(
                "One or more downstream circuit breakers are open. " +
                "Service is operating in degraded mode.",
                data: circuitStates)
            : HealthCheckResult.Healthy("All circuit breakers are closed.", data: circuitStates);
    }

    private static string ExtractCircuitBreakerState(ResiliencePipeline pipeline, string pipelineName)
    {
        // In production, use the CircuitBreakerStateProvider from the strategy component
        // This is a simplified representation of the extraction pattern
        return "Closed"; // Placeholder — real implementation binds to CircuitBreakerStateProvider
    }
}

The health endpoint surfacing circuit state is the difference between a platform team that discovers a circuit has been open for 40 minutes via a customer complaint, and one that receives a PagerDuty alert within 30 seconds.

Further Reading