Part 3: Tracing and Chaos
A System You Can Trust

I'm a CTO and founder with nearly two decades of experience driving growth and transformation through technology. At Stronghold Investment Management, I led the development of a systematic real asset trading platform and modernized everything from Salesforce strategy to custom cloud-native infrastructure. My background spans commercial real estate, e-commerce, and private markets — always focused on delivering innovation, velocity, and meaningful business outcomes. I hold a PhD in Theoretical & Computational Biophysics and was recognized as a Google Developer Expert in Cloud. I build high-trust, high-output teams. I’ve rebuilt broken cultures, hired top-tier engineers, and helped early-stage and PE-backed companies scale with confidence. System modernization is my specialty — not just upgrading software, but aligning teams and infrastructure with what the business actually needs. Currently, I lead client engagements through Heavy Chain Engineering and am building Newroots.ai, an AI-driven relocation advisory platform.
In the last two posts, we verified that our system can handle paused services (Part 1) and dead message brokers (Part 2). But resilience isn't just about surviving; it's about understanding what is happening when things go wrong.
The Architecture Recap
Our system has grown complex. A single user request now traverses multiple hops:

When a request passes through three different services and two queues, looking at logs is painful. You have to grep through different files, trying to match timestamps.
This is where Distributed Tracing comes in.
Experiment 1: Following the Breadcrumbs
The sandbox comes pre-configured with OpenTelemetry and Tempo. Every time the Sensor Sim generates a message, it attaches a unique trace_id. That ID is passed along to every service in the chain.
Step 1: Get a Trace ID
Let’s grab a real ID from the logs. You can use the Docker Dashboard to view the logs of sensor_sim, or run:
docker compose logs sensor_sim | grep "Published sensor reading" | tail -1
Copy the ID you see there.
Step 2: Visualize the Journey
Go to Grafana, click "Explore," and select Tempo as the datasource. Paste your ID into the query bar.
You’ll see a "Flame Graph." This visualizes the exact path that one request took.
The long bar at the top is the total time.
The smaller bars below show how long the Ingestion service took, how long the database write took, and how long the worker took to process it.

If a request is slow, you don't have to guess why. You can look at the trace and see exactly which bar is the longest.
Experiment 2: Injecting Chaos
Finally, let’s have some fun. We know how the system behaves when components stop. But how does it behave when the network gets "flaky"?
Included in the stack is Toxiproxy, a tool specifically designed to simulate bad network conditions. It sits in front of our database and broker, acting as a proxy that we can corrupt.
The Scenario: The Slow Database
Let's simulate a network issue that adds 500ms of latency to every database call. Run the helper script included in the repo:
./scripts/chaos/inject-latency.sh postgres 500
The Observation
Go back to your Grafana dashboard.
Latency Spikes: You’ll obviously see database transaction times go up.
Backpressure Returns: Remember Part 1? Because the database is slow, the Ingestion Service can't process messages as fast.
The CPU Change: Check the Docker Dashboard. You might see the CPU usage of the Ingestion service drop slightly (as it spends more time waiting on IO).
This shows the ripple effect. A slow database causes a slow API, which causes a backing-up queue.
Outside Reading
Focus: The "Async Processing" subgraph. How we reliably take events from the Outbox and ensure they are processed by workers, even under heavy load.
Diagram Scope: Dispatcher (Publishing) → RabbitMQ (Work Queue) → Worker Service
Topics to Investigate:
Reliable Publishing: How does the Dispatcher ensure RabbitMQ actually received the message before it marks the Outbox record as "done"? (Publisher Confirms).
The Work Queue Pattern (Competing Consumers): How RabbitMQ distributes tasks across multiple instances of the Worker Service to scale processing power.
Consumer Prefetch: Tuning RabbitMQ so one worker doesn't hog all the messages while others sit idle (load balancing at the consumer level).
Handling Failures in Workers:
Idempotency: The most critical concept for workers. Ensuring that processing the same message twice doesn't corrupt data (e.g., if a worker crashes halfway through).
Dead Letter Exchanges (DLX): Where messages go to die (and be inspected later) after too many failed processing attempts.
Play Around
Use this little system to play around. You can tear down small parts of it. You can stop the entire thing and restart exactly where you left off.
Conclusion
We’ve now toured the full loop of distributed systems engineering:
Architecture: Designing for resilience (Queues, Outbox).
Observability: Visualizing the health (Grafana, Tracing).
Chaos: Proactively breaking things to verify our assumptions.
This repository is yours to play with. Try pausing the workers. Try adding jitter to the network. The more you break it in the sandbox, the less scary it will be when it happens in production.





