Zero-Downtime Deployments: Engineering Strategies That Actually Work

February 1, 2024·14 min read·by Bishwambhar Sen

Zero-downtime deployments are one of those phrases that sounds deceptively simple until you're the engineer coordinating a production rollout at 11 AM on a Tuesday with 200,000 active sessions. Then the phrase collapses into a series of hard sub-problems: How do you drain connections from the old version without dropping in-flight requests? How do you migrate a 400-million-row table without locking it? What happens to the user who submitted a form halfway through a canary rollout when the new version has a different payload schema?

This post works through the mechanics — not the marketing copy — of each major zero-downtime strategy, with attention to the failure modes that only surface when you're doing it at scale.

Concept

Blue-Green Deployments

Blue-green maintains two identical production environments, imaginatively named blue and green. At any given time, one is live and one is idle. A deployment promotes the idle environment by shifting the load balancer target. The switch is atomic from the router's perspective: one moment 100% of traffic goes to blue, the next it goes to green.

The elegance is also the trap. Because the switch is all-or-nothing, you lose the ability to incrementally validate the new version under real load. If the new version has a CPU regression that only manifests under p95 traffic patterns, you will not catch it until you've already routed all production traffic to it.

The other trap is database state. Both environments typically share the same database. Any migration that ran during green's deployment window is now visible to blue if a rollback occurs. This means every schema change must be backward compatible with the previous application version for at least one deployment cycle — a constraint most teams discover by rolling back and watching the old code fail against the new schema.

Canary Releases

A canary routes a small percentage of traffic — typically 1% to 10% — to the new version while the remainder continues hitting the stable version. You instrument both cohorts with equivalent metrics (error rate, p99 latency, business events) and compare them statistically before widening the blast radius.

Canary releases shift the risk profile: rather than a binary all-or-nothing switch, you can observe behavior on real traffic with limited exposure. The cost is operational complexity. You now have two active versions in production simultaneously, which means your observability stack must be able to attribute metrics to a specific deployment version, not just to the service as a whole.

Rolling Updates

Rolling updates replace instances gradually, cycling through the fleet one (or a few) at a time. In container orchestration environments like Kubernetes, this is the default strategy: the scheduler terminates old pods and creates new ones incrementally, using readiness probes to confirm each new pod is healthy before advancing.

Rolling updates work well when your service is stateless or when session state is externalized to a store like Redis. They become painful when your code changes are not backward compatible — a 10-minute rolling window can leave you with old and new instances serving requests simultaneously from the same user.

Feature Flags

Feature flags are orthogonal to deployment strategy but essential to the zero-downtime picture. They decouple code deployment from feature activation. You can ship a complete feature to production while it remains disabled for all users, then activate it for internal users, then beta users, then by percentage, all without a new deployment.

The critical discipline is flag lifecycle management. Flags that are never cleaned up accumulate into technical debt that makes the codebase brittle and harder to reason about.

Constraints

In-flight request draining: When an instance is being retired — whether in a rolling update or a blue-green switch — you must allow in-flight requests to complete before terminating the process. The typical mechanism is a SIGTERM handler that stops accepting new connections, waits for active requests to finish (up to a configured timeout), then exits cleanly. HTTP keep-alive connections complicate this because a single TCP connection can carry multiple requests from the same client.

Database migration ordering: The iron rule is: always migrate the database before deploying the new application code that depends on it, and always ensure the old code can still function against the new schema. This produces a two-phase migration pattern. In phase one, you add the new column with a nullable type or default value, deploy the new code that reads from both the old and new columns, and begin backfilling. In phase two, once the old code version is completely gone, you drop the old column. Attempting to add a NOT NULL column without a default to a live table is a lock acquisition that will block all writes for the duration.

Connection pool exhaustion during rollover: Blue-green environments double your peak database connection count during the transition window, because both environments maintain open pools even though only one is live. If your connection limit is tightly tuned, the green environment's pool will fail to initialize.

Stateful sessions: If your application stores session state in-process (in-memory), a rolling update will invalidate sessions on the instances being replaced. Every session-bearing user hitting a new instance will be forcibly logged out. The remediation is to externalize session state before undertaking any rolling strategy.

Trade-offs

Strategy	Rollback Speed	Risk Exposure	Resource Cost	Schema Complexity
Blue-Green	Seconds (router flip)	All-at-once	2× environment	Shared DB: must be backward compat
Canary	Minutes to hours	Controlled %	Marginal (% of fleet)	Both versions live: must overlap
Rolling	Minutes	Incremental	No extra capacity	Both versions live: must overlap
Feature Flags	Milliseconds	Zero (deploy then gate)	None	Decoupled from schema

Blue-green is fastest to roll back but most expensive in infrastructure and puts the entire blast radius on the router decision. Canary is most sophisticated in risk management but requires investment in comparative observability. Rolling is the easiest to automate but offers the weakest story for schema coordination.

Feature flags are the only strategy that truly separates deployment from release — but they require discipline to avoid proliferating unmanaged flags across the codebase.

Code

Graceful Shutdown with In-Flight Request Draining

The following demonstrates a graceful shutdown sequence in ASP.NET Core. The key detail is the hostApplicationLifetime.ApplicationStopping token: when the orchestrator sends SIGTERM, ASP.NET Core begins refusing new connections while existing request handlers run to completion, up to the ShutdownTimeout limit.

// Program.cs
var builder = WebApplication.CreateBuilder(args);

builder.Services.Configure<HostOptions>(options =>
{
    // Give in-flight requests up to 30 seconds to complete after SIGTERM
    options.ShutdownTimeout = TimeSpan.FromSeconds(30);
});

builder.Services.AddHealthChecks()
    .AddCheck<ReadinessProbe>("readiness");

var app = builder.Build();

// Liveness: is the process alive?
app.MapGet("/health/live", () => Results.Ok(new { status = "alive" }));

// Readiness: is this instance ready to receive traffic?
// The rolling deployer advances only when this returns 200
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = check => check.Name == "readiness",
    ResultStatusCodes =
    {
        [HealthStatus.Healthy]   = StatusCodes.Status200OK,
        [HealthStatus.Degraded]  = StatusCodes.Status503ServiceUnavailable,
        [HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
    }
});

app.Run();

// ReadinessProbe.cs — controls whether the instance is routable
public sealed class ReadinessProbe : IHealthCheck
{
    private readonly DeploymentStateService _deploymentState;

    public ReadinessProbe(DeploymentStateService deploymentState)
        => _deploymentState = deploymentState;

    public Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        // During warm-up or drain: signal NOT ready so the load balancer
        // removes this instance from the rotation before we stop accepting requests
        if (_deploymentState.IsDraining)
        {
            return Task.FromResult(
                HealthCheckResult.Unhealthy("Instance is draining. Removing from load balancer pool."));
        }

        if (!_deploymentState.IsWarmedUp)
        {
            return Task.FromResult(
                HealthCheckResult.Degraded("Instance warming up. Holding off on traffic."));
        }

        return Task.FromResult(HealthCheckResult.Healthy("Ready to serve traffic."));
    }
}

Feature Flag–Gated Release with Percentage Rollout

This pattern shows a minimal feature flag evaluator that uses a deterministic hash of the user ID to ensure the same user consistently lands in or out of the experiment — preventing the jarring experience of a feature toggling on and off between requests.

public sealed class FeatureFlagService
{
    private readonly IConfiguration _config;

    public FeatureFlagService(IConfiguration config)
        => _config = config;

    /// <summary>
    /// Returns true if the feature is enabled for the given user.
    /// Uses a stable hash so the same user always gets the same experience
    /// within the same percentage window.
    /// </summary>
    public bool IsEnabled(string featureName, string userId)
    {
        var flagConfig = _config.GetSection($"FeatureFlags:{featureName}");

        if (!flagConfig.Exists())
            return false;

        bool globalOverride = flagConfig.GetValue<bool>("Enabled");
        if (globalOverride)
            return true;

        int rolloutPercentage = flagConfig.GetValue<int>("RolloutPercentage"); // 0–100
        if (rolloutPercentage <= 0)
            return false;

        // Deterministic: same user always maps to same bucket
        int userBucket = Math.Abs(HashCode.Combine(featureName, userId)) % 100;
        return userBucket < rolloutPercentage;
    }
}

// Usage in an API controller — old and new checkout flows coexist during rollout
[ApiController]
[Route("api/checkout")]
public class CheckoutController : ControllerBase
{
    private readonly FeatureFlagService _flags;
    private readonly LegacyCheckoutService _legacyCheckout;
    private readonly EnhancedCheckoutService _enhancedCheckout;

    public CheckoutController(
        FeatureFlagService flags,
        LegacyCheckoutService legacyCheckout,
        EnhancedCheckoutService enhancedCheckout)
    {
        _flags = flags;
        _legacyCheckout = legacyCheckout;
        _enhancedCheckout = enhancedCheckout;
    }

    [HttpPost]
    public async Task<IActionResult> InitiateCheckout(
        [FromBody] CheckoutRequest request,
        [FromHeader(Name = "X-User-Id")] string userId)
    {
        if (_flags.IsEnabled("enhanced-checkout-v2", userId))
        {
            var result = await _enhancedCheckout.ProcessAsync(request);
            return Ok(result);
        }

        var legacyResult = await _legacyCheckout.ProcessAsync(request);
        return Ok(legacyResult);
    }
}

Database Migration Sequencing (Expand/Contract Pattern)

The expand/contract pattern executes schema changes in two discrete deployment cycles, never leaving the database in a state incompatible with either the current or previous application version.

// Migration 001: EXPAND — add new column alongside old one
// Both old and new app versions can now run against this schema
public class AddUserTierColumnMigration : IDbMigration
{
    public string Version => "001-add-user-tier";

    public async Task UpAsync(IDbConnection connection)
    {
        // Safe: nullable column with default, no locking on writes
        await connection.ExecuteAsync(@"
            ALTER TABLE Users
            ADD COLUMN tier VARCHAR(20) NULL DEFAULT 'standard';
        ");

        // Backfill in batches to avoid long-running transactions
        await connection.ExecuteAsync(@"
            UPDATE Users
            SET tier = CASE
                WHEN subscription_level >= 3 THEN 'premium'
                WHEN subscription_level >= 1 THEN 'standard'
                ELSE 'free'
            END
            WHERE tier IS NULL
            LIMIT 5000;
        ");
    }
}

// Migration 002: CONTRACT — drop old column once no app version references it
// Only run this after confirming zero instances of the old app are alive
public class DropSubscriptionLevelColumnMigration : IDbMigration
{
    public string Version => "002-drop-subscription-level";

    public async Task UpAsync(IDbConnection connection)
    {
        await connection.ExecuteAsync(@"
            ALTER TABLE Users DROP COLUMN subscription_level;
        ");
    }
}