Back to dxkitEvaluate

Estimate dxkit's impact on your agent loop.

Enter your cadence. We multiply it by measured benchmark rates to estimate the net-new debt an unsupervised loop would ship, and what catching it in the loop is worth.

When a coding agent finishes a loop, dxkit reruns a deterministic gate and blocks the stop if the change introduced a net-new finding, so it is repaired in the warm loop. In our benchmark, unsupervised loops left net-new debt in 69% of runs (11/16); dxkit caught 0 of 16 observed. Fixing a finding in the warm loop took about ~34% fewer agent turns than deferring it to a later cold session. On large repos, structural context also bounds the worst runs. This page applies those measured rates to your cadence. The only number you provide is cadence.

Study I: Loop safety →Study II: Cost of deferral →Study V: Graph context →All benchmarks →

Agent runs / weekOne run = a coding-agent loop that ends in a stop (total across your repos).WorkloadChanges the repair-premium only.Repo size (LOC)Sizes the gate-latency estimate and whether the structural-context wins apply (they appear on large repos).

Default: 10 mixed agent runs/week on a ~574k-LOC monorepo (the size we measured the structural-context wins on). Rates from our 16-task benchmark (JS/TS, Claude Sonnet 4.6). On smaller repos the structural-context tiles read "minimal at this size."

What's a turn? (the unit these results use)▾

One step of the agent loop: the model looks at the current state, makes a single move (reads a file, edits code, runs a test), and sees the result. A task is many turns chained together. More turns means more tokens, more wall-clock time, and more cost, all moving together, which is why we report turns: it is the unit the loop counts directly and it stays stable even when token prices change.

In tokens:a turn's token cost depends on how much context the task carries, so there is no single number. As a feel for scale, the small fixes in our benchmark ran a few thousand tokens per turn, while a repair on a real repo (OWASP NodeGoat) ran ~13 to 17 turns and on the order of 1M+ tokens total, because the agent re-sends its growing context on every turn.

Debt-shipping runs prevented

~7 / week → 0

~7 of your 10 weekly runs (about 364 a year) would reach your branch with net-new debt unsupervised. In our benchmark that was 11 of 16 runs (a prompt-only self-check still left it in 9/16); dxkit caught 0 of 16 observed.

Cheaper to fix in the loop

~34% fewer turns

Fixing a finding in the warm loop that produced it took ~34% fewer agent turns (what's a turn?) than deferring it to a fresh cold session, which has to re-read and re-understand the code first (test-gap; corroborated on a real repo, NodeGoat).

Rework avoided, in turns

≈ 2,599 turns / year

Avoidable rework at our synthetic repair scale. A turn is one step of the agent loop (what's a turn?). Bigger repos cost more turns per fix, so treat this as magnitude, not a prediction for your codebase.

Worst-case run

~57% fewer tokens

On a ~574k-LOC monorepo the worst agent runs used about 57% fewer tokens: the naive agent rabbit-holes and structural context caps it. Your repo is in the size range where we measured this.

More predictable runs

Swings roughly halved

Run-to-run cost variability dropped about 50% on the large repo, so fewer runaway sessions. Measured at ~574k LOC.

Gate cost

a few seconds per stop (~11s on a ~574k-LOC repo)

Per stop. The scan is scoped to the changed files, so the verdict is identical to a full scan.

What this does not capture: a passing test can still admit a subtly-wrong fix. In our reward-hacking study agents did not game visible tests (0/36), but a single test under-specifies intent. dxkit gates net-new debt, not subtle incorrectness.

⚠ First-order estimates from early research. 16 seeded tasks · 2 task types · JS/TS · Claude Sonnet 4.6; structural-context numbers from one ~574k-LOC repo. A floor: it counts the cost of fixing later vs in-loop, not the cost of debt that ships and is never caught. The deferral premium is robust for feature/mixed and weak for config/secrets (median ≈ 0); we lead on the robust case. Worst-case and variability are repo-size-gated (Amdahl explains the size dependence but is not numerically fit). Cadence is your input; every rate is measured. This page estimates; it does not scan your code. How we compute this.

Gate your own loop in one command.

Installs dxkit and registers the Claude Code Stop hook additively. Existing findings are grandfathered; only net-new regressions block.

Get started with dxkit View on GitHub

MethodologyHow we compute this, and what it does not claim.▾

What you provide vs what we measured. Cadence (runs per week) is the only speculative input. Workload and repo size only select which measured constant applies; they do not invent numbers. Every rate is measured, and the page estimates from cadence times those rates. It does not scan your code.

Plain-language glossary

Agent turn: one step of the agent loop (read state, make a single move, see the result). A task is many turns. Turns track tokens, time, and cost together; we report turns because the loop counts them directly and they stay stable when token prices change. Tokens per turn vary with context size: small fixes ran a few thousand tokens per turn; a real-repo repair ran ~13 to 17 turns and on the order of 1M+ tokens total.
Net-new debt (debt-shipping run): a run that stopped and declared done with a new, detector-backed finding present (most often new code without a test; sometimes a secret or vulnerability).
Deferral premium: the extra agent turns to fix the same finding in a later cold session vs the warm loop that produced it.
Worst-case / variability: the upper tail and spread of session token use on large repos (the graph-context effect).

The constants

Quantity	Value	Sample	Source
Unsupervised run left net-new debt	69% (11/16)	Study I, n=16	docs →
Prompt-only self-check still escaped	56% (9/16)	Study I, n=16	docs →
dxkit Stop-gate escaped (observed)	0 of 16	Study I, n=16	docs →
Deferral premium, test-gap (robust)	+51% turns	Study II, 8 reps	docs →
Deferral premium, config/secrets (weak)	+44% turns (weak)	Study II, median ≈ 0	docs →
Worst-case session tokens, large repo	57% lower	Study V, ~574k LOC	docs →
Run-to-run variability, large repo	~50% lower	Study V, 30 sessions	docs →
Mean session tokens, large repo	30% lower	Study V, ~574k LOC	docs →
NodeGoat real-repo backtest	+31% turns	real repo, n=1	docs →
Reward-hacking tamper rate	0 of 36	Study C	docs →

The formulas

escapingRunsPerWeek = round(runsPerWeek * 0.69)
escapingRunsPerYear = escapingRunsPerWeek * 52
premiumTurnsPct     = premium[workload].turnsPct
isWeak              = premium[workload].weak   // suppress the ratio when true
illustrativeTurnsSavedPerYear =
  round(escapingRunsPerYear * illustrativeWarmTurns[workload] * premiumTurnsPct)
latencyBand  = locLatencyBand(loc)
contextBand  = locContextBand(loc)   // 'limited' | 'growing' | 'measured-range'

The Amdahl model, and why it is not a knob. Whole-session savings are bounded by savings ≈ f·(1 − 1/s) − O/T: even an infinite per-operation speedup caps savings at the orientation fraction f, and a fixed overhead O/T dominates on small repos (a forced-graph probe cost ~66% more on the small app). That is why the worst-case and variability wins appear on the large repo and vanish on small ones, so we use repo size to gate those tiles qualitatively. The model is directional and not numerically fit: we never compute or scale a per-LOC figure, and only the two measured anchors (~2k LOC gives roughly none, ~574k LOC gives the 57% worst-case cut) are ever shown.

Is this validated? The headline multiplies two independent experiments (how often loops leave debt, times how much more a deferred fix costs), so the product is a real prediction, not a restatement. The pattern holds even on the secret task, where escapes happen naturally (3 of 8). The relative deferral premium also holds on a real repository (NodeGoat: +31% turns), which is why we report the ratio and treat absolute turns as illustrative.

What would make this better. Larger n, more task types, languages, and agent families; the structural-context numbers reproduced beyond one repo; a numerical Amdahl fit. Limits today: 16 seeded tasks, 2 task types, JS/TS, one agent family, n=1 on the real repo.

Measure on your repo. The web number uses benchmark rates. For findings on your actual code, run dxkit locally; it is offline and nothing leaves your machine.

Claim ledger and retractions →Amdahl model →Graph context →