Skip to main content

Command Palette

Search for a command to run...

Ten Failure Modes from a Year of Running an Agent Harness — Part 2

Updated
8 min read
Ten Failure Modes from a Year of Running an Agent Harness — Part 2
J

I'm a CTO and founder with nearly two decades of experience driving growth and transformation through technology. At Stronghold Investment Management, I led the development of a systematic real asset trading platform and modernized everything from Salesforce strategy to custom cloud-native infrastructure. My background spans commercial real estate, e-commerce, and private markets — always focused on delivering innovation, velocity, and meaningful business outcomes. I hold a PhD in Theoretical & Computational Biophysics and was recognized as a Google Developer Expert in Cloud. I build high-trust, high-output teams. I’ve rebuilt broken cultures, hired top-tier engineers, and helped early-stage and PE-backed companies scale with confidence. System modernization is my specialty — not just upgrading software, but aligning teams and infrastructure with what the business actually needs. Currently, I lead client engagements through Heavy Chain Engineering and am building Newroots.ai, an AI-driven relocation advisory platform.

This is the second of two articles cataloging failure modes I've observed running an agent harness in production. The first five, in Part 1, shared a theme: the system certified success when no real work had been done — green tests over a hollow build, gates that vouched for each other, dismissals that compounded.

The remaining five are different. These are patterns where the signals themselves go wrong: an agent saves itself thirty seconds and burns the operator's afternoon; a config field reviews clean and does nothing; a guardrail fires on a predictable false-positive case until everyone learns to ignore it; a review produces volume but no judgment; a prototype gets faithfully implemented as if it were a spec.

The common thread is misaligned interpretation — the agent reads a signal one way, reality means another, and the gap doesn't surface until something downstream goes wrong.

6. Local Optimization, Global Tax

Pattern: An agent "saves time" by bypassing a safety rail or grinding on a dead end — and the savings land on the agent's clock while the cost lands on the human's.

Scenario A. An agent disables a sandbox "to go faster," which surfaces an approval prompt to the operator on every subsequent action. The agent saved itself thirty seconds and spent the operator's attention all afternoon.

Scenario B. An agent, stuck on a failing test, generates dozens of variant attempts over the better part of two days without ever re-baselining its assumptions — burning a five-figure compute bill — until a human notices out of sheer impatience.

How it sneaks in. Agents optimize the objective in front of them (this turn, this test) and have no native feel for the operator's attention budget or the wall-clock/$ meta-cost. "Just trying to save time" is the honest, recurring confession.

The tell / cost. The agent's local metrics look fine while the global ones — operator interruptions, dollars, calendar days — quietly blow out. It's economically invisible inside the loop and obvious only from outside it.

Why the fix is hard. The leverage is giving the system a sense of the global cost function — operator attention and money as first-class budgets, with a bidirectional alarm when high-confidence effort yields low progress. Encoding "stop digging" without making the agent timid is the trick.

7. The Inert Flag (compiles clean, does nothing)

Pattern: A configuration change passes every check and looks done — but has no runtime effect whatsoever.

Scenario. Someone wants a skill to "run at higher effort" and adds an effort: field to its config. It compiles. It installs. It reviews clean. It is also completely inert: nothing in the runtime ever reads that field. The change is a no-op wearing the costume of a feature — and because the toolchain validated it, everyone believes the behavior changed.

How it sneaks in. Config schemas that pass-through unknown keys silently; a plausible-sounding field name; the absence of any test that asserts the field actually does something. It's the cousin of #1 — structurally green, behaviorally null — but at the configuration layer, where it's even easier to miss.

The tell / cost. You "ship" the same setting three times because it never took. Trust in the config surface erodes; people start cargo-culting flags that may or may not be wired.

Why the fix is hard. The discipline is refusing to let a knob exist unless something demonstrably turns when you turn it — and knowing, at design time, which "small changes" are actually inert before you write them. That judgment is exactly what a forward-deployed engineer is paid for.

8. Cry-Wolf Gates

Pattern: A guardrail keys on a signal that doesn't account for a legitimate variant, fires false positives, and trains everyone to ignore it.

Scenario. A coupling check is built to flag "scope changes that lack a decision record." But it keys on a comparison that only exists once a feature has a prior baseline. On fresh work — which has no baseline yet — it flags the feature's own normal content as a violation, every single time. After the third false alarm, the operator stops reading the gate's output. The one time it's right, nobody's listening.

How it sneaks in. Gates are written against the common case and never tested against the empty/legacy/first-run cases. A rule that's correct 90% of the time but noisy on a predictable 10% does more damage than no rule, because false positives are a tax on attention and a solvent for trust.

The tell / cost. Operators developing muscle-memory to dismiss a specific warning. A gate that exists but is functionally off because nobody believes it anymore.

Why the fix is hard. The non-obvious requirement is that a gate must be correct on the boundary cases or silent on them — never confidently wrong — and that false-positive rate is itself a first-class quality metric for the gate. Tuning that without punching holes in coverage is the work.

9. Surface Lint as Theater

Pattern: An automated reviewer reports formatting and naming nits, declares the review done, and never engages the architecture.

Scenario. An agent "reviews" a new service and comes back with a tidy list: line too long here, inconsistent import order there, rename this variable. It reads like diligence. What it never asked: does the data actually flow the way the design claims? Are the module boundaries real or is this a distributed monolith wearing microservice costumes? Is this "configuration" secretly code? The review engaged everything a linter already covers and nothing a linter can't.

How it sneaks in. Surface findings are easy to generate and feel productive; architectural judgment is hard and unfalsifiable in the moment. An agent optimizing for "produce review output" will produce the cheap kind. The masquerade — what the system claims to be vs. what it does — goes unexamined.

The tell / cost. Reviews that are voluminous and weightless. The real structural debt ships untouched while everyone admires the clean import blocks.

Why the fix is hard. The leverage is drawing a hard line: the machine owns surface lint, and the reviewer's entire value is the structural judgment the machine can't make — boundaries, coupling, the gap between claim and behavior. Forcing a reviewer up to that altitude, reliably, is the differentiator.

10. The Prototype Mistaken for the Spec

Pattern: A throwaway prototype's incidental choices get read as requirements and faithfully implemented.

Scenario. A designer hacks together a clickable prototype to show a flow. It has placeholder copy, a hardcoded demo dataset, and a happy-path-only journey because that's all a demo needs. The harness ingests it as intent. Now the placeholder copy is in production strings, the demo dataset's quirks are encoded as business rules, and there are no error states anywhere — because the prototype, built to impress, never had any.

How it sneaks in. A prototype and a specification are both "the artifact in front of the agent," and nothing distinguishes illustrative from required. Fidelity in the wrong dimension — making the demo look real — actively misleads.

The tell / cost. Features that are pixel-faithful to a demo and semantically hollow: no edge cases, no failure handling, copy that was never meant to ship. The non-happy-path simply doesn't exist, because the source material didn't model it.

Why the fix is hard. The discipline is reading a prototype as a question, not an answer — separating "this is required" from "this is one way it could look" before a line is written. Encoding that distinction so agents honor it is the part worth paying for.

Ten patterns, two articles. The thread running through all of them is that an agent harness, left to its own devices, will generate confident artifacts at every layer — code, tests, config, reviews, postmortems — and the artifacts will look correct in isolation. The work of building a harness that ships real software is mostly the work of making "looks correct" and "is correct" the same thing, at every seam where they could come apart.

About Heavy Chain

Heavy Chain works with engineering organizations to deploy AI-native PDLC and SDLC — the Etc framework, together with the tooling, gates, and review rituals that make it real — into existing teams. The harder half of the work is change management: helping the new lifecycle stick without breaking what already works. The patterns above are drawn from that work; if any of them feel familiar, that's where we come in.