AI-Assisted Engineering Harness
The model is the engine. The harness is the rails.
This chapter documents the small instrumentation layer Atmosphere uses to keep its own engineering loop honest under heavy AI-assisted contribution. It exists because we burned ourselves enough times shipping prose that disagreed with code (capability counts off by 3, runtime lists missing adopters, “PENDING” features that shipped weeks ago) that we eventually turned the catch-and-fix protocol into running code. The pattern is small and reusable; if you maintain a project that AI agents contribute to, the shape here transfers.
The framing is Justin Reock’s AI-Assisted Engineering (InfoQ, 2026-05): the orgs that get +20% from AI have an instrumented feedback loop on claim quality; the orgs that get −20% don’t. Utilization metrics (”% of code AI-authored”, “AI-assisted PR count”) trigger Goodhart’s Law and lose validity once they become targets. The right impact metric is change failure rate by agent claim.
The directory shape
Section titled “The directory shape”Everything lives under .harness/ at the repo root, plus a few scripts
and a Claude Code hook. Anyone seeing the directory knows it’s
project-engineering plumbing, not runtime code:
.harness/├── README.md Operator manual for this directory├── capabilities.snapshot.json Canonical capability matrix snapshot└── drift-log.md Append-only record of caught hallucinations
scripts/├── regen-capability-snapshot.sh Re-derive snapshot from source├── validate-capability-claims.sh Pre-push gate: prose ↔ snapshot agreement└── validate-drift-log.sh Pre-push gate: append-only structural hygiene
modules/ai-test/.../CapabilitySnapshotTest.java JUnit mirror of the bash validator
.claude/├── hooks/check-drift-log.sh Stop hook: block session-end on undocumented drift└── settings.json Project-level Claude Code hook registrationCapability snapshot — pin prose against running code
Section titled “Capability snapshot — pin prose against running code”AiCapability is a 20-entry Java enum
(source).
Each of the 11 framework runtimes overrides
AbstractAgentRuntimeContractTest.expectedCapabilities() to declare its
exact subset, and the contract test asserts the runtime’s live
capabilities() method returns the same set. That’s the existing per-runtime
gate — it catches code drift but doesn’t catch prose drift in the
README’s count claims.
The snapshot closes that gap. scripts/regen-capability-snapshot.sh
parses AiCapability.java and every *RuntimeContractTest.{java,kt}
file, then writes a deterministic JSON aggregate to
.harness/capabilities.snapshot.json:
{ "schema_version": 1, "capabilities": { "count": 20, "names": ["AGENT_ORCHESTRATION", "AUDIO", "BUDGET_ENFORCEMENT", ...] }, "runtimes": { "count": 9, "items": [ { "name": "AdkAgentRuntime", "module": "modules/adk", "language": "java", "expected_capabilities": ["AGENT_ORCHESTRATION", ...] }, ... ] }}Two enforcement points consume it:
scripts/validate-capability-claims.sh— wired into pre-push Tier 1. Grepsmodules/ai/README.mdfor tight count patterns (\bAll \d+ runtimes?\band similar) and asserts each match equals the snapshot count.CapabilitySnapshotTestinmodules/ai-test— same logic in pure Java, somvn testcatches the same drift.
The snapshot itself is committed; PR reviewers see “9 → 10 runtimes” as a
diff hunk without grepping. The LC_ALL=C shell forcing in the regen
script ensures bash sort matches Java’s String.compareTo so the JSON
ordering is identical to the JUnit test’s TreeSet<String> view.
This is structurally the same pattern
caveman’s evals/snapshots/results.json
uses for token-compression numbers — commit the snapshot to git so CI is
deterministic and free, and any change is reviewable as a diff.
Drift log — record the rate, not just incidents
Section titled “Drift log — record the rate, not just incidents”.harness/drift-log.md is append-only. Every time a Claude session
catches itself (or gets caught) saying something that disagrees with the
code, the agent adds a structured row:
| # | Claim | Truth | Slip path | Gate added |
|---|---|---|---|---|
| N | what was stated | what the code says | how it bypassed existing gates | the regression-class fix (validator, test, memory update, prose grep) — none is a legitimate value |
Bundling log update + gate addition + prose fix in one commit makes each session’s impact diff-reviewable. Per Reock, the signal is the rate of entries over time, not the cleanliness of any single one. Don’t gatekeep; better to over-record minor drift than under-record it.
The first 10 entries (seeded the day the log was created) record actual
session events: a memory file claimed “1 Quarkus build step” when the
code had 14; “PENDING” features that had shipped weeks earlier;
off-by-one runtime counts in narrative prose. The 11th entry recorded a
CI-caught regression where a wall-clock test asserted
observed > limit but our scheduled-task fix made observed == limit a
legitimate trip outcome. That entry’s gate column reads “JDK 21/26 CI
matrix caught it within 12 min” — which is the most honest gate value
of all: an existing gate worked.
Two enforcement points for the drift log
Section titled “Two enforcement points for the drift log”The log is structurally append-only. Two layers keep it that way and keep it populated:
scripts/validate-drift-log.sh — pre-push Tier 1. Asserts:
- File exists and parses.
- ≥1
## YYYY-MM-DDsection. - No future-dated sections.
- Sections in chronological order (oldest top, newest bottom).
- Pre-existing sections (older than today) match
origin/mainverbatim.
It does not enforce that drift gets added — that’s the next layer’s job.
Claude Code Stop hook at .claude/hooks/check-drift-log.sh,
registered in .claude/settings.json. Fires at session end:
- Reads transcript path from hook input JSON.
- Greps for high-precision drift-correction patterns:
stale memory,\boff-by-one\b,I (was wrong|claimed)…(but|actual|truth),memor… was/is wrong/stale/out of date,fabricated rule/stat/count/claim,verified by grep…disagree/contradict/wrong/stale. - If matched and
.harness/drift-log.mdwas not modified this session (working tree, untracked, or last 3 commits), emits{"decision": "block", "reason": "..."}to force the agent to either append an entry or explicitly state the correction was trivial. stop_hook_active=trueshort-circuits to no-op so deliberate skips don’t loop.
Patterns are deliberately narrow to minimize false positives. If a recurring real correction shape isn’t matching, add a new pattern with concrete real-session evidence — don’t loosen existing ones.
What this looks like in practice
Section titled “What this looks like in practice”A typical session might go:
- Claude claims “X is shipped” based on a 30-day-old memory file.
- ChefFamille (or
git grepself-catch) says “verified by grep — that class doesn’t exist onmain”. - Claude reads the actual source, confirms the drift.
- Claude appends an entry to
.harness/drift-log.mddocumenting the claim, truth, slip path, and what gate was added. - Claude bundles the log entry + any prose fix + the gate (e.g., a
regex pattern in
validate-capability-claims.sh) into one commit. - Pre-push Tier 1 runs both validators in <1s; commit lands.
- At session end the Stop hook checks the transcript: drift language present, log file modified, no block.
Without the hook, session 2 of the same day forgets and makes the same
class of claim again. With the hook, the agent is re-engaged before the
session can end, and either logs or explicitly states “trivial — not
worth logging” (the hook then no-ops via stop_hook_active).
What this is not
Section titled “What this is not”- Not a replacement for code review. The validators only check prose-vs-snapshot agreement and structural hygiene. They don’t catch semantic bugs, performance regressions, or architectural mistakes.
- Not a utilization metric. We don’t count ”% of commits AI-authored” or “tokens spent per feature”. Those measures invite Goodhart’s Law.
- Not a substitute for verification at session start. The
feedback_drift_log.mdmemory rule says: re-verify against current code before quoting any memory file older than the most recent CHANGELOG bump. The drift log records what slipped past that rule; the rule itself is the primary defense.
Adopting the pattern in your project
Section titled “Adopting the pattern in your project”The shape is small enough to copy. Concretely, for a project with an LLM-facing agent integration:
- Pick one or two count claims you make in your README that have gone wrong before. Runtime count, capability count, sample count, backend count — anything quantitative that you’ve shipped wrong.
- Build a snapshot parsed from canonical source. JSON, committed
to git, regenerated by a single shell script. Add
LC_ALL=Cso sort is deterministic across hosts. - Add one validator that greps your README for those count claims and asserts against the snapshot. Wire it into your pre-push hook.
- Add an append-only drift log with one row per caught
hallucination. Don’t stress about the schema —
claim,truth,slip path,gateis enough. - Add a Claude Code Stop hook (or your agent runtime’s equivalent) that greps the transcript for drift-correction language and blocks session end if the log wasn’t updated. Use narrow patterns; broad patterns cause false-positive loops.
That’s the whole pattern. Roughly 500 lines of bash + 250 lines of Java in our case. Lower bound for any project: the snapshot + one validator, maybe 100 lines, gives you the diff-reviewable curve.
Further reading
Section titled “Further reading”- Justin Reock, AI-Assisted Engineering — InfoQ talk, 2026-05. The DX measurement framework (utilization vs. impact vs. cost) and the Goodhart’s Law warning.
walkinglabs/learn-harness-engineering— the five-subsystem framework (Instructions, State, Verification, Scope, Lifecycle). Treats the harness as engineering work rather than configuration.juliusbrussee/caveman— the snapshot-as-source-of-truth pattern with a three-arm baseline/control/treatment eval methodology. Inspired the diff-reviewable shape ofcapabilities.snapshot.json.- Atmosphere’s
.harness/README.md— operator manual for the directory.