Skip to main content

Orchestration vs Choreography: The Observability Argument

Sagaweaw Team
Escrito porSagaweaw TeamPlatform Engineering

Choreography is elegant on a whiteboard. Orchestration wins when you need to answer "what actually happened?" at 2am. The real reason Sagaweaw chose orchestration isn't control flow — it's observability. You can't replay what you can't observe.

What Choreography Means in Practice

In a choreography-based architecture, there is no central coordinator. Each service listens for events and reacts independently. Order service publishes OrderCreated. Inventory service picks it up, reserves stock, publishes StockReserved. Payment service picks that up, charges the card, publishes PaymentCharged. And so on.

It looks beautiful in a sequence diagram. Every service is decoupled. No one depends directly on another. The system feels like it breathes on its own.

The problem is what happens when it stops breathing.

Invisible Failures

When three services react to the same event and one fails silently, who knows? Who retries? Where is the state?

Consider this failure mode: OrderCreated is published. Inventory service processes it successfully. Payment service crashes mid-execution — the charge was attempted but no PaymentCharged event was ever published. Shipping service never heard a thing. The customer sees a confirmed order. Your warehouse has reserved stock. No money was collected.

In a choreography system, diagnosing this requires:

  1. Querying the inventory service logs for the order ID
  2. Querying the payment service logs for the same ID
  3. Querying whatever dead-letter queue your broker maintains
  4. Reconstructing a timeline manually from timestamps across 3+ systems with potentially skewed clocks

This is not a contrived edge case. This is Tuesday morning on-call.

The "Distributed Trace as a Lie" Problem

OpenTelemetry and distributed tracing are powerful. But they show spans, not intent.

A trace will tell you that POST /inventory was called, that it took 47ms, and that it returned 200. It will not tell you:

  • Which saga this call belonged to
  • What the expected next step was
  • Whether the outcome was correct for the business flow
  • What should happen if this step needs to be compensated

You can see the machinery. You cannot see the plan. When something goes wrong, you are reconstructing intent from execution artifacts — like reading a car crash report to understand where the driver intended to go.

The Orchestration Trade-off

Orchestration introduces a coordinator: one service (or library) that knows the full plan. Every step is explicit. Every transition is recorded. The state machine lives in one place.

This is more coupling at the coordinator level. That's the honest cost. If the orchestrator has a bug, it affects all flows. If it's unavailable, flows can't progress.

But what you get in return is a single source of truth for state. At any moment, you can ask: "What is the current state of order #12345?" and get a definitive answer — not a reconstruction from event logs.

Why the Sagaweaw Dashboard Exists

The Sagaweaw dashboard is only possible because of orchestration.

When a saga runs, every step transition is persisted: which step executed, what its result was, when it started, when it ended, whether it was compensated. The dashboard reads this data and renders a timeline.

With choreography, you'd have to reconstruct that timeline from 5 different service logs, correlate them by order ID, handle clock skew, and hope nothing was silently dropped. That's not a dashboard — that's archaeology.

With orchestration, the question "what happened to this saga?" has a direct database query as its answer.

When Choreography is Actually Right

Choreography is not wrong. It's wrong for this problem.

It excels when:

  • You have truly independent services with no shared business flow
  • Throughput is extremely high and a central coordinator would bottleneck
  • Events represent facts about the world, not steps in a transaction that can fail
  • You have no compensation requirement — things don't need to be undone

High-throughput event streaming, analytics pipelines, audit logs — these are natural fits for choreography. But for business flows with compensation logic (orders, payments, onboarding), the need to answer "what happened?" makes orchestration the right default.

You can't replay what you can't observe. And you can't observe what has no central state.

Junte-se ao debate!

Arquitetura é feita de trade-offs. O que você achou das decisões tomadas em "Orchestration vs Choreography: The Observability Argument"? Compartilhe seus cenários, tire dúvidas e debata com outros engenheiros da comunidade Sagaweaw.

Comentar no GitHub Discussions