Part 3: Tracing and Chaos

In the last two posts, we verified that our system can handle paused services (Part 1) and dead message brokers (Part 2). But resilience isn't just about surviving; it's about understanding what is happening when things go wrong.

The Architecture Recap

Our system has grown complex. A single user request now traverses multiple hops:

When a request passes through three different services and two queues, looking at logs is painful. You have to grep through different files, trying to match timestamps.

This is where Distributed Tracing comes in.

Experiment 1: Following the Breadcrumbs

The sandbox comes pre-configured with OpenTelemetry and Tempo. Every time the Sensor Sim generates a message, it attaches a unique trace_id. That ID is passed along to every service in the chain.

Step 1: Get a Trace ID

Let’s grab a real ID from the logs. You can use the Docker Dashboard to view the logs of sensor_sim, or run:

docker compose logs sensor_sim | grep "Published sensor reading" | tail -1

Copy the ID you see there.

Step 2: Visualize the Journey

Go to Grafana, click "Explore," and select Tempo as the datasource. Paste your ID into the query bar.

You’ll see a "Flame Graph." This visualizes the exact path that one request took.

The long bar at the top is the total time.
The smaller bars below show how long the Ingestion service took, how long the database write took, and how long the worker took to process it.

If a request is slow, you don't have to guess why. You can look at the trace and see exactly which bar is the longest.

Experiment 2: Injecting Chaos

Finally, let’s have some fun. We know how the system behaves when components stop. But how does it behave when the network gets "flaky"?

Included in the stack is Toxiproxy, a tool specifically designed to simulate bad network conditions. It sits in front of our database and broker, acting as a proxy that we can corrupt.

The Scenario: The Slow Database

Let's simulate a network issue that adds 500ms of latency to every database call. Run the helper script included in the repo:

./scripts/chaos/inject-latency.sh postgres 500

The Observation

Go back to your Grafana dashboard.

Latency Spikes: You’ll obviously see database transaction times go up.
Backpressure Returns: Remember Part 1? Because the database is slow, the Ingestion Service can't process messages as fast.
The CPU Change: Check the Docker Dashboard. You might see the CPU usage of the Ingestion service drop slightly (as it spends more time waiting on IO).

This shows the ripple effect. A slow database causes a slow API, which causes a backing-up queue.

Outside Reading

Focus: The "Async Processing" subgraph. How we reliably take events from the Outbox and ensure they are processed by workers, even under heavy load.

Diagram Scope: Dispatcher (Publishing) → RabbitMQ (Work Queue) → Worker Service

Topics to Investigate:

Reliable Publishing: How does the Dispatcher ensure RabbitMQ actually received the message before it marks the Outbox record as "done"? (Publisher Confirms).
The Work Queue Pattern (Competing Consumers): How RabbitMQ distributes tasks across multiple instances of the Worker Service to scale processing power.
Consumer Prefetch: Tuning RabbitMQ so one worker doesn't hog all the messages while others sit idle (load balancing at the consumer level).
Handling Failures in Workers:
- Idempotency: The most critical concept for workers. Ensuring that processing the same message twice doesn't corrupt data (e.g., if a worker crashes halfway through).
- Dead Letter Exchanges (DLX): Where messages go to die (and be inspected later) after too many failed processing attempts.

Play Around

Use this little system to play around. You can tear down small parts of it. You can stop the entire thing and restart exactly where you left off.

Conclusion

We’ve now toured the full loop of distributed systems engineering:

Architecture: Designing for resilience (Queues, Outbox).
Observability: Visualizing the health (Grafana, Tracing).
Chaos: Proactively breaking things to verify our assumptions.

This repository is yours to play with. Try pausing the workers. Try adding jitter to the network. The more you break it in the sandbox, the less scary it will be when it happens in production.

https://github.com/inchoate/distr-system