Blog/Managing Distributed Transactions with the Saga Pattern
saga-patterndistributed-transactionsmicroservicesconsistency

Managing Distributed Transactions with the Saga Pattern

January 15, 2024·14 min read·by Bishwambhar Sen
A distributed order processing flow showing choreography-based saga steps with compensation paths highlighted in red

Concept

The two-phase commit protocol (2PC) achieves atomicity across distributed participants by coordinating a prepare phase and a commit phase through a central coordinator. When a participant fails during the commit phase, 2PC holds locks until recovery — a blocker that can last seconds to minutes in real systems. In a microservices architecture where services are independently deployable and operated, this lock-holding behaviour is unacceptable. A slow or crashed inventory service should not block the payment service, the order service, and every waiting thread in the coordinator.

The Saga pattern, originally described by Hector Garcia-Molina and Kenneth Salem in 1987 for long-lived database transactions, solves this by decomposing a distributed transaction into a sequence of local transactions, each committed independently within its service boundary. Each local transaction publishes an event or sends a message that triggers the next step. If any step fails, a sequence of compensating transactions is executed in reverse order to undo the previously committed steps.

Critically, Saga does not provide ACID atomicity. It provides ACD without I — atomicity (via compensation), consistency (eventual), and durability, but not isolation. Two concurrent sagas can observe each other's intermediate state. This is the central constraint that architects must design for explicitly.

There are two structural variants:

Choreography-based Saga: Each service listens to domain events and decides autonomously what to do next. No central coordinator exists. The saga's progress is implicit in the event stream. This variant scales well but makes the overall saga flow difficult to observe and reason about. Debugging a failed saga requires correlating events across multiple service logs.

Orchestration-based Saga: A central orchestrator (a dedicated service or a durable workflow engine) explicitly instructs each participant service what to do and tracks the saga's state machine. The flow is explicit and observable. The cost is a coordination bottleneck and a single logical component that must handle all compensation logic.

Constraints

The Isolation Problem: Dirty Reads Between Sagas

Consider an e-commerce order saga: Reserve Inventory → Charge Payment → Confirm Order → Send Notification. At the "Charge Payment" step, the inventory has been reserved (committed locally in the inventory service). A concurrent read of inventory by another customer will see reduced stock. If the payment charge then fails and the saga compensates by releasing the inventory reservation, the other customer's read was a dirty read — it observed a transient state that was ultimately rolled back.

The standard mitigation is the countermeasures pattern described by Chris Richardson:

  • Semantic Lock: Mark resources being modified by a saga with a "pending" or "dirty" flag. Other operations that encounter this flag either wait (polling), fail fast, or read only committed state. The flag is cleared on saga completion or compensation.
  • Commutative Updates: Design operations that can be applied in any order without producing inconsistent state. Account debits and credits are commutative if order does not matter.
  • Pessimistic View: Assume a saga step might be compensated and design downstream consumers to tolerate this.
  • Reread Value: Before committing a saga step, re-read the data that the step depends on to detect whether another saga has modified it since the saga began.

None of these countermeasures are free. Semantic locks add read complexity. Reread value adds latency and a potential conflict failure mode. The correct approach is domain-specific.

Idempotency Requirements for Compensation

Compensating transactions must be idempotent. Because message delivery in distributed systems is at-least-once, a compensation message may be delivered and processed multiple times. A compensating transaction that releases an inventory reservation must be safe to execute twice — the second execution should be a no-op, not a double-release.

Achieving idempotency requires compensation operations to be keyed on a saga correlation ID (or transaction ID). The compensation handler checks whether the compensation has already been applied for this saga ID before executing. This requires a deduplication store — typically a table in the local database with a unique constraint on (saga_id, step_name, operation_type).

Saga Failure Recovery and Backward vs. Forward Recovery

When a saga step fails, you have two strategies:

  • Backward recovery (compensation): Undo all previously committed steps. This is the default understanding of saga. It is appropriate when the business domain supports undo — a held inventory reservation can be released, a pending payment can be voided.
  • Forward recovery (retry): Persist the saga state and retry the failed step until it succeeds. This is appropriate when compensation is impossible — once an email notification is sent, you cannot "unsend" it. The notification step must be retried until successful, then marked complete.

Most real sagas require a mix of both. Non-retryable steps use backward recovery; non-compensatable steps use forward recovery with exponential backoff.

Trade-offs

Choreography vs. Orchestration: The Observability vs. Autonomy Axis

Choreography maximises service autonomy. Each service only knows its own inputs and outputs. The saga's flow is implicit in the event stream topology. This is powerful for small sagas (2–3 steps) where the flow is unlikely to change. It becomes a liability at scale: a 7-step saga spanning 7 services, where each step has a compensation path, produces a web of event subscriptions and implicit state transitions that is extremely difficult to introspect or debug. "Why did this saga stall?" requires correlating events across 7 service log streams.

Orchestration makes the saga flow explicit in code and in runtime state. The orchestrator holds the saga's state machine, knows which step failed, and drives recovery. Tools like MassTransit's StateMachine, Temporal, or NServiceBus Sagas persist the orchestration state durably. The cost is a new service to operate and a coordination bottleneck — if the orchestrator is unavailable, sagas cannot progress. This is generally acceptable because orchestrators are designed to be highly available stateless executors that persist state in a durable store (the database or a workflow engine).

For sagas with more than 3–4 steps, compensation logic, or frequent debugging requirements: use orchestration. For simple event chains where services genuinely should not know about each other: choreography is appropriate.

Code

The following C# example demonstrates an orchestration-based saga using a simplified durable state machine. This represents the core of what frameworks like MassTransit or NServiceBus provide, made explicit for educational clarity.

public class OrderSagaOrchestrator
{
    private readonly IInventoryServiceClient _inventoryClient;
    private readonly IPaymentServiceClient _paymentClient;
    private readonly IOrderServiceClient _orderClient;
    private readonly ISagaStateRepository _sagaStateRepo;
    private readonly ILogger<OrderSagaOrchestrator> _logger;

    public OrderSagaOrchestrator(
        IInventoryServiceClient inventoryClient,
        IPaymentServiceClient paymentClient,
        IOrderServiceClient orderClient,
        ISagaStateRepository sagaStateRepo,
        ILogger<OrderSagaOrchestrator> logger)
    {
        _inventoryClient = inventoryClient;
        _paymentClient = paymentClient;
        _orderClient = orderClient;
        _sagaStateRepo = sagaStateRepo;
        _logger = logger;
    }

    public async Task ExecuteAsync(CreateOrderCommand command, CancellationToken ct)
    {
        var sagaState = new OrderSagaState
        {
            SagaId = Guid.NewGuid(),
            OrderId = command.OrderId,
            CustomerId = command.CustomerId,
            LineItems = command.LineItems,
            CurrentStep = SagaStep.Started,
            CreatedAt = DateTimeOffset.UtcNow
        };

        await _sagaStateRepo.SaveAsync(sagaState, ct);

        try
        {
            // Step 1: Reserve inventory
            sagaState.CurrentStep = SagaStep.ReservingInventory;
            await _sagaStateRepo.SaveAsync(sagaState, ct);

            var reservationId = await _inventoryClient.ReserveAsync(
                new InventoryReservationRequest(sagaState.SagaId, command.LineItems), ct);
            sagaState.InventoryReservationId = reservationId;

            // Step 2: Charge payment
            sagaState.CurrentStep = SagaStep.ChargingPayment;
            await _sagaStateRepo.SaveAsync(sagaState, ct);

            var chargeId = await _paymentClient.ChargeAsync(
                new PaymentChargeRequest(sagaState.SagaId, command.CustomerId, command.TotalAmount), ct);
            sagaState.PaymentChargeId = chargeId;

            // Step 3: Confirm order
            sagaState.CurrentStep = SagaStep.ConfirmingOrder;
            await _sagaStateRepo.SaveAsync(sagaState, ct);

            await _orderClient.ConfirmAsync(
                new OrderConfirmationRequest(sagaState.SagaId, command.OrderId), ct);

            sagaState.CurrentStep = SagaStep.Completed;
            await _sagaStateRepo.SaveAsync(sagaState, ct);

            _logger.LogInformation("Order saga {SagaId} completed successfully", sagaState.SagaId);
        }
        catch (Exception ex) when (sagaState.CurrentStep >= SagaStep.ChargingPayment)
        {
            _logger.LogError(ex, "Order saga {SagaId} failed at step {Step}. Initiating compensation.",
                sagaState.SagaId, sagaState.CurrentStep);

            await CompensateAsync(sagaState, ct);
        }
    }

    private async Task CompensateAsync(OrderSagaState sagaState, CancellationToken ct)
    {
        sagaState.CurrentStep = SagaStep.Compensating;
        await _sagaStateRepo.SaveAsync(sagaState, ct);

        // Compensate in reverse order of execution
        if (sagaState.PaymentChargeId.HasValue)
        {
            try
            {
                await _paymentClient.VoidChargeAsync(
                    new VoidChargeRequest(sagaState.SagaId, sagaState.PaymentChargeId.Value), ct);
                _logger.LogInformation("Saga {SagaId}: payment charge voided", sagaState.SagaId);
            }
            catch (Exception ex)
            {
                // Compensation failures require human intervention — log and alert
                _logger.LogCritical(ex,
                    "COMPENSATION FAILURE: Could not void payment charge {ChargeId} for saga {SagaId}. " +
                    "Manual intervention required.",
                    sagaState.PaymentChargeId, sagaState.SagaId);
                sagaState.RequiresManualIntervention = true;
            }
        }

        if (sagaState.InventoryReservationId.HasValue)
        {
            try
            {
                await _inventoryClient.ReleaseReservationAsync(
                    new ReleaseReservationRequest(sagaState.SagaId, sagaState.InventoryReservationId.Value), ct);
                _logger.LogInformation("Saga {SagaId}: inventory reservation released", sagaState.SagaId);
            }
            catch (Exception ex)
            {
                _logger.LogCritical(ex,
                    "COMPENSATION FAILURE: Could not release inventory reservation {ReservationId} for saga {SagaId}.",
                    sagaState.InventoryReservationId, sagaState.SagaId);
                sagaState.RequiresManualIntervention = true;
            }
        }

        sagaState.CurrentStep = sagaState.RequiresManualIntervention
            ? SagaStep.FailedRequiresIntervention
            : SagaStep.Compensated;

        await _sagaStateRepo.SaveAsync(sagaState, ct);
    }
}

The idempotent compensation handler below shows how downstream services protect themselves from duplicate compensation messages. This is the pattern each participant service must implement independently.

public class InventoryCompensationHandler : IMessageHandler<ReleaseReservationRequest>
{
    private readonly IInventoryRepository _repository;
    private readonly ICompensationLog _compensationLog;
    private readonly ILogger<InventoryCompensationHandler> _logger;

    public InventoryCompensationHandler(
        IInventoryRepository repository,
        ICompensationLog compensationLog,
        ILogger<InventoryCompensationHandler> logger)
    {
        _repository = repository;
        _compensationLog = compensationLog;
        _logger = logger;
    }

    public async Task HandleAsync(ReleaseReservationRequest request, CancellationToken ct)
    {
        // Idempotency check: has this compensation already been applied?
        var alreadyApplied = await _compensationLog.ExistsAsync(
            sagaId: request.SagaId,
            stepName: "ReleaseReservation",
            ct: ct);

        if (alreadyApplied)
        {
            _logger.LogWarning(
                "Duplicate compensation message for saga {SagaId}, step ReleaseReservation. Ignoring.",
                request.SagaId);
            return; // Safe no-op — idempotent
        }

        var reservation = await _repository.GetReservationAsync(request.ReservationId, ct);

        if (reservation == null)
        {
            // Reservation doesn't exist — already released or never created. Record and exit.
            _logger.LogWarning(
                "Reservation {ReservationId} not found during compensation for saga {SagaId}",
                request.ReservationId, request.SagaId);
            await _compensationLog.RecordAsync(request.SagaId, "ReleaseReservation", ct);
            return;
        }

        await _repository.ReleaseReservationAsync(reservation.Id, ct);

        // Record compensation AFTER applying it — not before (at-least-once delivery means we
        // must tolerate the scenario where we apply but fail to record, requiring a retry)
        await _compensationLog.RecordAsync(request.SagaId, "ReleaseReservation", ct);

        _logger.LogInformation(
            "Reservation {ReservationId} released for saga {SagaId}",
            request.ReservationId, request.SagaId);
    }
}

Further Reading