Back to Open Sourcedxkit

Context to make the change. A gate to stop cleanly.

dxkit gives AI coding agents a repo code graph for structural context and a deterministic stop gate that blocks net-new regressions before they declare done.

In our loop-safety benchmark, vanilla loops shipped net-new debt in 11/16 runs, prompt-only self-review still escaped 9/16, and dxkit had 0/16 observed escapes.

Evaluate dxkit View GitHub Read the benchmark →

Local·Offline·Deterministic·No model in the gate·Existing debt grandfathered

The failure mode

Agents can pass tests and still leave the repo worse.

An autonomous coding loop keeps editing until the agent decides it is done. Tests and linters can catch broken code, but they do not answer a different question: did this change introduce something new and worse than the baseline?

That is how a loop can add a feature, run green tests, and still leave behind a new secret, a critical dependency regression, an untested path, or another detector-backed finding. The agent reports success because the usual stop condition is under-specified.

Prompting the agent to “be careful” is not enough. The same model that wrote the change is grading its own work with the same blind spots.

Demo

See the gate block and repair in seconds.

Run the local demo from any directory:

dxkit demo: loop-guardrail

Tap “Expand” to view the run full-screen.

How it works

Baseline, change, gate, repair.

Build structural context

dxkit scaffolds the repo and builds a code graph so agents can orient around files, symbols, dependencies, callers, callees, and blast radius.

Baseline today’s findings

dxkit records the current state of the repo. Existing debt is grandfathered, so brownfield repos are usable from day one.

Run trusted checks

dxkit runs or ingests established scanners and checks: secrets, dependency vulnerabilities, code patterns, test gaps, duplication, SARIF, and more.

Block only net-new regressions

When the agent tries to stop, dxkit compares the final tree against the baseline. Known findings do not block. New findings introduced by this change do.

Repair in-loop

The gate returns a concrete reason the agent can act on. The agent fixes the issue while the task context is still warm and tries to stop again.

Scope

What dxkit is, and is not.

A deterministic verification layer

It baselines today's findings, fingerprints them across churn, and blocks only net-new regressions.

Not a scanner replacement

It runs and ingests scanners (gitleaks, Semgrep, CodeQL, Snyk, SARIF) and makes their findings enforceable. It does not claim to find more bugs than they do.

Not an LLM judge

No model decides whether the gate passes. The model can repair findings. The gate itself is deterministic, and the prompt does not grow as the baseline grows.

Not a guarantee of safe code

It blocks detector-backed net-new findings it can observe. You still need tests, review, scanners, and judgment.

Try it on your repo

Install the gate, baseline today's debt, and go.

One command installs dxkit and registers the Claude Code Stop hook additively, so your existing settings are preserved. Existing findings are grandfathered; only net-new regressions block.

The default preset is security-only. Expand to broader debt checks when you are ready.

Preset	What blocks	Best for
`security-only`	Secrets and critical/high vulnerabilities	First install, launch gate, production repos
`full-debt`	Security plus test gaps and maintainability regressions	Agent-loop experiments, stricter teams, research runs

Research

Prompting helped a little. A deterministic gate changed the stop condition.

In the loop-safety benchmark, the task was not to catch every possible bug. It was narrower: when a coding loop introduces a detector-backed finding, does the loop stop dirty or repair before stopping?

Loop	Stopped with net-new debt
Vanilla loop	11/16
Prompt-only self-check	9/16
dxkit deterministic stop gate	0/16 observed

The takeaway is not that dxkit makes code universally safe. The takeaway is that “tests pass, looks done” is not a sufficient stop condition for autonomous coding loops. A deterministic net-new gate gives the loop a repeatable external check.

Essay

Your AI coding agent ships debt 69% of the time when you're not looking

Launch essay on autonomous loops, prompt-only self-review, and deterministic stop gates.

Read the article Reference

Benchmarks

Full methodology, reproducibility notes, threats to validity, and exact caveats.

View on GitHub

Evaluate

What is this worth on your loop?

Multiply your agent cadence by our measured benchmark rates to estimate the net-new debt an unsupervised loop would ship, and what catching it in the loop is worth. Cadence is the only number you provide.

Estimate dxkit's impact

FAQ

Frequently asked questions

What problem does dxkit solve?

dxkit gives AI coding agents a deterministic stop condition. An autonomous loop keeps editing until the agent decides it is done. Existing checks can tell you whether code is broken, but they do not always answer the loop-level question: did this change introduce detector-backed debt relative to the baseline, and may the agent stop? dxkit baselines existing findings, reruns trusted checks when the agent tries to stop, and blocks only net-new findings with a concrete repair reason.

Is dxkit a scanner?

No. dxkit runs and ingests scanners, but it is not trying to out-detect them. It uses tools such as gitleaks, Semgrep, OSV, and npm audit, and external SARIF sources such as Snyk Code and CodeQL. The layer dxkit adds is the baseline, finding identity, net-new comparison, Stop-gate verdict, loop ledger, and code graph context. dxkit does not replace scanners. It makes scanner findings enforceable inside an agent loop.

How is dxkit different from CI?

CI usually runs after the agent has produced a change. dxkit runs inside the agent loop, at the stop boundary. If the agent introduced a finding, dxkit can block while the agent still has task context, and it returns a concrete repair reason so the agent fixes only what it introduced. CI is still useful; dxkit gives the agent an earlier, local, baseline-relative stop decision.

How is dxkit different from Semgrep, Snyk, SonarQube, or CodeQL?

Use those tools. dxkit can ingest their findings. The difference is not detection quality; it is architecture and tempo. Scanners and quality gates detect and report issues. CI and PR checks gate merges or builds. dxkit packages detector output into a local, per-stop, agent-facing decision: BLOCK or CLEAN, with a repair reason the agent can act on.

Why not just use Claude Code hooks directly?

Claude Code hooks are the mechanism. dxkit is the packaged gate. A Stop hook gives you a place to block an agent from exiting; dxkit adds the parts you would otherwise build yourself: baseline creation, detector orchestration, finding fingerprinting, net-new comparison, Stop hook wiring, repair feedback, the loop ledger, and code graph context. Hooks let you intercept the stop event. dxkit defines and enforces the stop condition.

Does dxkit use an LLM to decide whether the gate passes?

No. The model may repair findings, but the gate verdict is not produced by an LLM. dxkit stores baseline state outside the model context and compares detector-backed findings against that baseline. The same repo state, baseline, and detector output produce the same verdict, with no cost, prompt growth, or non-determinism from a model call.

Does dxkit guarantee safe code?

No. dxkit blocks detector-backed net-new findings that its configured detectors observe. It does not prove program correctness, catch every vulnerability, or replace tests, review, CI, or human judgment. The benchmark result is 0/16 observed escapes on seeded detector-backed tasks, not proof that no escape is possible.

What blocks by default?

The default preset is security-only: secrets and critical or high vulnerabilities. It is intentionally narrow because it should be bounded, must-fix, and cheap to gate. The broader full-debt preset is opt-in because repairs such as writing tests can be expensive. The headline benchmark used full-debt; the product default is security-only.

Why is existing debt grandfathered?

Because most real repositories are brownfield. If a repo already has hundreds or thousands of findings, a zero-findings gate is unusable. dxkit captures today's state as the baseline and blocks only new regressions. Known debt is allowed. New debt blocks. This makes the gate usable on day one without a total cleanup before adoption.

Does dxkit catch subtle correctness bugs?

Not by itself. dxkit catches detector-backed net-new findings: secrets, high-severity vulnerabilities, test gaps, maintainability findings, and external SARIF findings depending on configuration. It does not prove semantic correctness. Correctness still needs specifications: tests, review, held-out cases, and blast-radius context. dxkit blocks detector-backed net-new debt; it does not prove that the program is correct.

What is the code graph for?

The code graph gives the agent structural context while it works: callers, callees, blast radius, relevant files, existing patterns, and where a fix or feature should attach. The graph helps the agent make the change; the gate decides whether it may stop. The measured benefit is predictability, not guaranteed lower cost: on a large repo the worst case fell sharply and variance tightened, while on a small app the mean was basically unchanged.

Does dxkit require Claude Code?

Claude Code is the first-class loop integration, but the core gate is agent-agnostic. The deterministic core (baseline, scan, fingerprint, net-new comparison, block or allow verdict) can run in CI, pre-push, a custom loop, or another agent workflow. Claude Code is the first high-leverage integration because its Stop hook gives dxkit a natural place to enforce the stop decision.

Is the fixture demo real?

Yes. The fixture demo creates a temporary repo, creates a baseline, introduces a detector-backed finding, runs the real gate, blocks, repairs the fixture, reruns the gate, and tears the fixture down. If the required detector is missing, dxkit prints a clearly labeled illustration instead.

Give your coding agent a stop condition outside the model.

Run dxkit locally, watch the stop gate block a net-new finding, and wire it into your own agent loop.

Evaluate dxkit View GitHub Read the benchmark Open an issue