Proof-Carrying Operations: From Digital Twins to Trusted Action
By Sid, Founder at Vyuh
A digital twin is often described as a synchronized representation of a physical system. That is true, but incomplete. A representation becomes operationally valuable only when it can support decisions safely.
As AI agents move into industrial environments, the interesting question is not whether an agent can retrieve plant data or summarize a procedure. The interesting question is whether an AI-assisted recommendation can be trusted enough to become an operational action: to evaluate whether a work order can proceed, flag a clearance violation, prepare a permit package, or support the human decision to energize a circuit.
In a high-stakes environment, the answer cannot be "because the model said so."
A useful industrial action should carry proof. It should carry the asset it applies to, the source facts it depends on, the standards that govern it, the constraints that were evaluated, the uncertainty that remains, the person or policy that approved it, and the audit record that lets the decision be reconstructed later.
This is the shift from digital twin to trusted action. More precisely, it is the shift toward proof-carrying operations.
A note on the word proof, because it is doing specific work here. This is not a claim that an industrial action can be proven safe in every possible world, in the formal-methods sense. Industrial systems are full of uncertainty: stale sensors, incomplete models, conflicting standards, changing conditions. The claim is narrower and more practical. It borrows from proof-carrying code, where a program ships with a machine-checkable certificate of a safety property. A proof-carrying action ships with a domain-specific evidence packet that an independent verifier can check: resolved identity, attributed facts, applicable rules, freshness, residual uncertainty, approval state, and an audit record. The proof is scoped to one proposed action, one system state, and one context window. The verifier may still return unknown, and often that is the right answer.
1. Representation is not assurance
A digital twin is frequently treated as the endpoint of industrial digitization. Connect the systems, synchronize the state, render the model, and the work is considered done.
It is not done. A twin is a representation. It tells you what is true. It does not, on its own, tell you what is safe to do next, or prove that a proposed action is allowed.
Layering a language model on top does not close that gap. A capable agent can read the twin, retrieve documents, and produce fluent recommendations. But fluency is not assurance. The agent's output is a probability distribution over plausible text. In a control room, on an energized line, or beside a vessel holding hazardous material, plausible is not the bar.
The bar is this: before a recommendation becomes an action, it should cross an assurance boundary, and it should not be allowed to cross unless it carries verifiable proof.
2. A plant is a system of systems
Start with a concrete situation, because the abstraction only earns trust once it survives contact with a real plant.
Consider one reactor, R-501, and a request to begin hot work on it. The facts that decide whether that work can proceed do not live in one place. They are scattered across systems that were never designed to agree:
Engineering model R-501 is a pressure vessel
Maintenance system Hot-work permit has expired
Historian Still holding 18 psig
Live control Running at 268°F
ERP / inventory 14,500 lb ethylene oxide on the unitEach of these systems is correct. The engineering model is right about what R-501 is. The maintenance system is right about the permit. The historian is right about pressure. None of them is wrong.
The danger is that no one system is complete. No single screen shows all five facts at once, and the hazard lives precisely in their intersection: hot work, on a pressurized vessel, above ambient temperature, full of a reactive material, with an expired permit. Any one fact is benign. Together they describe an action that must not be approved automatically.
No one system is wrong. The danger is that no one system is complete.
An agent that reasons over natural-language summaries of these systems can easily produce a confident, wrong answer, because the join across systems is exactly where ambiguity, staleness, and identity mismatch accumulate. The architecture has to take that join seriously, as a first-class, governed step, rather than leave it to a prompt.
3. The assurance boundary
Define the central abstraction:
The assurance boundary is the layer that decides whether an AI-generated recommendation is allowed to become an operational action.
The boundary receives a proposed capability invocation from an agent (for example, work_order.evaluate_hot_work_readiness on asset:R-501) and admits it only when the required evidence and checks are satisfied. Note the capability name: the agent asks the system to evaluate readiness and route the result, not to approve hot work itself. Its output is not free text. It is one of three states.
The three-state result is the heart of it. A binary pass/fail verifier is tempting and wrong, because the most dangerous real-world condition is not "violation," it is "we do not actually know." A two-state system is under pressure to resolve every case into pass or fail, and the easy direction to resolve toward is pass. A three-state system makes uncertainty a first-class, visible outcome that routes to a person.
The agent can be probabilistic. The admission control for high-stakes action should not be.
A small formal model
It helps to state this precisely, without making it unreadable. Let:
S_t = observed system state at time t
G_t = governed operating graph at time t
R_t = applicable rule set at time t
C = capability the agent wants to invoke
E = evidence packet assembled for the proposed action
V = deterministic verifierBy deterministic we mean deterministic in procedure, not certainty in the evidence. The packet E can carry uncertainty: confidence scores, freshness thresholds, simulation outputs, uncertainty bands. The verifier is free to evaluate all of it. What must hold is reproducibility: given the same capability, evidence packet, graph state, rule set, and policy version, V always returns the same pass, fail, or unknown. The judgment is auditable because the procedure is fixed, even when the world it judges is not.
An action is admissible only if:
V(C, E, G_t, R_t) -> passIf the verifier returns fail, the action is blocked. If it returns unknown, the action escalates to a human. The single rule that keeps the whole system honest is:
In high-stakes operations,
unknownmust never be silently converted intopass.
This is the property that no amount of model quality can substitute for. A better model narrows the unknown band. It does not earn the right to collapse it.
4. Proof-carrying actions
The boundary does not hand agents raw access to operational systems. It exposes typed capabilities, and a capability is a contract, not an API endpoint: what is being requested, what evidence is required, what preconditions must hold, what effects are permitted, what effects are explicitly forbidden, and how it fails safely. For R-501, in compact form:
capability: work_order.evaluate_hot_work_readiness
required_evidence: [asset_identity, work_order_scope, permit_status,
operating_pressure, operating_temperature,
material_inventory, applicable_rules, source_freshness]
preconditions:
- identity_confidence >= configured_threshold
- mandatory_sources_fresh == true
- no_unresolved_rule_conflicts == true
effects: [emit readiness_decision (pass | fail | unknown), block, escalate]
non_effects: [does_not_execute_physical_work, does_not_override_permit_state,
does_not_grant_final_human_authorization]
safe_failure:
- missing_mandatory_evidence -> unknown
- stale_mandatory_source -> fail_or_unknown
- unknown -> human_reviewThe non_effects block is the point. The agent can request an evaluation and the verifier can emit a readiness result, but the agent never holds raw authority over the physical system. That is what separates an operating layer from a chatbot with tools.
If the verifier is the gate, the evidence packet is the thing that has to pass through it. It is the proof the action carries.
An evidence packet records, for one proposed action: the resolved canonical identity of the asset, the source facts the decision depends on (each with its origin system and timestamp), the rules that were evaluated and how each resolved, the verifier's overall result, the required approval path, the versions of the graph, rules, and model it was evaluated against, the context window it is valid for, and a flag for whether the whole packet can be reproduced from its inputs.
Here is the packet for the R-501 request. System names are generic by design: the architecture does not care which vendor's maintenance system or historian supplied a fact, only that the fact is attributed, timestamped, and governed.
{
"decision_id": "decision:WO-44721:R-501:2026-06-01T14:03:12Z",
"request": "Can work order WO-44721 proceed on R-501?",
"capability_requested": "work_order.evaluate_hot_work_readiness",
"policy_version": "work_order_policy:2026.04",
"rule_set_version": "hot_work_rules:2026.04",
"graph_version": "operating_graph:plant7:2026-06-01T14:00:00Z",
"model_version": "plant_model:as_operated:v17",
"context_window": {
"evaluated_at": "2026-06-01T14:03:12Z",
"valid_until": "2026-06-01T14:08:12Z",
"expires_on_change": ["permit_status", "internal_pressure", "temperature",
"material_inventory", "asset_identity",
"rule_set_version", "work_order_scope", "approval_state"]
},
"canonical_asset": {
"id": "asset:R-501",
"type": "pressure_vessel",
"source_ids": {
"DEXPI": "R-501",
"maintenance": "EQ-88421",
"historian": "PRES_VESSEL_501",
"OPC_UA": "ns=2;s=Unit7.R501",
"ERP": "FL-PLANT-AREA-R501"
},
"identity_confidence": 0.98,
"identity_resolution_method": "deterministic_tag_match_with_graph_confirmation"
},
"freshness": {
"max_allowed_age_seconds": 300,
"oldest_fact_age_seconds": 312,
"result": "fail"
},
"facts": [
{ "source": "maintenance", "claim": "hot_work_permit_status", "value": "expired",
"timestamp": "2026-06-01T14:02:30Z" },
{ "source": "historian", "claim": "internal_pressure", "value": "18 psig",
"timestamp": "2026-06-01T14:02:45Z" },
{ "source": "OPC_UA", "claim": "temperature", "value": "268°F",
"timestamp": "2026-06-01T14:02:51Z" },
{ "source": "ERP", "claim": "inventory_material", "value": "14,500 lb ethylene oxide",
"timestamp": "2026-06-01T13:58:00Z" }
],
"rules_evaluated": [
{ "rule_id": "hot_work.requires_valid_permit", "rule_version": "2026.04",
"result": "fail", "evidence": ["hot_work_permit_status=expired"] },
{ "rule_id": "vessel.requires_safe_operating_state", "rule_version": "2026.04",
"result": "fail", "evidence": ["internal_pressure=18 psig", "temperature=268°F",
"material=ethylene oxide"] }
],
"verifier": { "name": "hot_work_readiness_verifier", "version": "1.3.2",
"procedure": "deterministic", "result": "fail" },
"decision": "block_and_escalate",
"approver_required": "operations_supervisor",
"audit_status": "reproducible"
}The artifact is not just an audit log written after the fact. It is the object the action must carry before it crosses the assurance boundary.
This is also where the difference between a search index and an operating system shows up. The state space underneath the verifier is not a document store; it is an operating graph: canonical assets, locations, work orders, permits, sensor tags, material inventories, standard clauses, operating states, capabilities, decisions, evidence, approvals, model versions, and rule versions, joined by relationships such as same_as, governed_by, requires_permit, has_state, contains_material, violates, approved_by, and verified_by.
This operating graph is not assembled at question time: it is built and governed in a separate, earlier phase, where source systems are reconciled into one model, concepts are bound to cited rules, a person signs off on the result, and everything is versioned so that later changes re-enter that build rather than overwrite it silently. (How the model is built is a subject of its own, which we will take up separately.)
A knowledge graph answers questions. An operating graph constrains action.
The distinction is load-bearing. A search index retrieves documents; a knowledge graph connects concepts; an operating graph encodes admissibility: which assets are the same across systems, which source is authoritative for each claim, which standards govern each action, which capabilities are allowed, which preconditions must hold, which approvals are required, and which facts expire and when. It is not a memory layer the agent reads from. It is part of the control surface the verifier reasons over.
The same machinery is not specific to process safety. In a grid codes-and-standards workflow, the identical pattern resolves a span across geospatial, design, live-state, asset-registry, and codes systems, binds it to the governing clause, and returns a cited verdict. One representative finding from that domain:
finding: Vertical clearance below the NESC minimum over a road
rule: nesc_232_road_clearance (scope: regulatory)
citations: NESC Rule 232, Table 232-1
facts_matched: concept=overhead_conductor, crossing=road:truck_traffic,
voltage_class=230kV, clearance_m=6.40 < required_m=6.77
sources: geospatial(reference), design(P1), live-state(P1), asset-registry(P2)Different domain, identical shape: resolved identity, matched facts, the exact governing clause, a deterministic comparison, and a traceable chain back to the systems each fact came from.
5. The agent boundary
None of this removes the language model. It places it.
There is real, valuable work for a probabilistic agent to do, and there is work it must not do independently. Drawing that line explicitly is what makes the architecture safe rather than merely sophisticated.
The agent helps assemble the case: it interprets what the operator meant, gathers candidate facts, drafts a readable explanation of the verdict. But the verdict itself, the applicability of a standard, the resolution of identity in a high-risk path, and the conversion of unknown into anything else, all sit on the deterministic side of the line.
The LLM may help assemble the case. It should not be the court.
When standards genuinely conflict, the boundary does not pick a winner. In a real clearance case, an internal design standard, a code table, and a jurisdictional order can all speak to the same crossing, and one of them may even be weaker than the code allows. A black-box model resolves that by sounding confident. The assurance boundary resolves it by returning unknown, surfacing all three sources with their values, flagging that a design standard cannot relax a code minimum, and handing the decision to an engineer. That escalation is not a failure of the system. It is the system working.
6. Runtime assurance and context windows
A proof is not permanent. It is true for a specific action, in a specific state, for as long as that state holds.
This is the idea of a context window for a decision:
A decision is valid only for the state window in which its evidence remains true.
For R-501, the assessment should expire and re-run if any of its inputs move: permit status, pressure, temperature, material inventory, asset-identity confidence, the version of an applicable standard, the work-order scope, sensor freshness, model version, or human approval state. A clearance computed against the 2023 edition of a code is not a clearance against the 2027 edition. A pass computed against a five-minute-old pressure reading is not a pass against a stale one.
Assurance is not a certificate attached to the twin forever. It is a continuously refreshed claim about a specific action in a specific context window.
This is also where the architecture has to be honest about how the real world breaks. Each failure mode has a defined safe behavior, and in every case the safe direction is toward unknown and escalation, never toward a silent pass.
| Failure mode | Example | Safe behavior |
|---|---|---|
| Identity ambiguity | R-501 maps to two equipment records | Return unknown; escalate |
| Stale source data | Pressure reading older than the freshness threshold | Block, or require an updated state |
| Rule conflict | Two standards govern the same action differently | Flag the conflict; require a human decision |
| Missing source | The maintenance system is unavailable | Degrade the capability; do not approve high-risk work |
| Model drift | As-built model diverges from the engineering model | Expire the prior assessment; re-verify |
| Approval mismatch | Agent proposes an action outside its role permission | Block the capability invocation |
| Evidence incompleteness | No material-inventory fact available | Require more evidence, or escalate |
The point of the assurance boundary is not to make every action pass. It is to make unsafe uncertainty visible.
7. Lifecycle verification and validation
Verification and validation are usually treated as gates you pass through before an asset goes live. In a proof-carrying architecture they do not end at commissioning. They become runtime responsibilities, because the operating context keeps changing after the asset is in service.
| Lifecycle phase | Twin / thread input | Assurance check | Output |
|---|---|---|---|
| Plan / design | Engineering model, constraints, standards | Feasibility, conflicts, missing requirements | Design finding |
| Build / integrate | As-built model, equipment data, interfaces | Model-to-reality consistency | Verification record |
| Operate | Live state, historian, permits, work orders | Operational safety and compliance | Approve / block / escalate |
| Maintain | Asset state, maintenance history, procedures | Work-order readiness and risk | Auditable work package |
| Modify / modernize | Changed assets, rules, code, integrations | Regression against the prior baseline | Change assurance record |
Verification and validation do not disappear when the asset enters operation. They become runtime responsibilities.
The loop is continuous: design verification flows into integration verification, into operational validation, into maintenance validation, into change assurance whenever the asset is modified. And at any point, drift or a context change expires the relevant evidence packet and forces a re-check before the next action is admitted.
8. Why a lab beats a generic proof of concept
The honest constraint is that you cannot build a universal industrial assurance layer in the abstract. You earn it one use case at a time, because the proof is only as real as the specific sources, standards, and approval chain it binds.
That is why the right adoption model is a focused lab, not an open-ended pilot. A lab takes one hard, regulated workflow and produces five concrete artifacts:
- A source-system map: which systems hold the facts the decision requires, and which is authoritative for each.
- A governed operating graph: the canonical assets, relationships, rules, and capabilities for that one workflow.
- A rule and standards binding: the clauses, policies, and operating constraints the verifier evaluates.
- A verifier contract: explicit
pass/fail/unknownbehavior, including the safe-failure rules. - A working app: an agentic workflow that emits evidence packets, routes approvals, and preserves the audit trail.
The point of the lab is not to produce a demo. The point is to produce the first admissible action path.
Once that path exists for one workflow, it becomes the repeatable pattern: the second use case reuses the same operating graph, the same verifier discipline, the same evidence-packet shape, with new sources and new rules bound in. The assurance layer accretes, workflow by workflow, instead of being promised all at once.
9. Closing
Digital twins gave industry a faithful representation of its assets. That was necessary, and it was not sufficient. The next step is not a more autonomous agent on top of the twin. It is an assurance boundary between recommendation and action, and a requirement that every high-stakes action carry its proof across it.
That is the whole thesis, in one line:
Digital twins are not the destination. Trusted action is.
We are not claiming to have solved industrial autonomy. We are proposing and implementing a specific architecture, one high-stakes workflow at a time: a governed operating graph for context, approved capabilities for action, and a deterministic verifier for admission, with the evidence and the audit trail to prove it.
Notes
- Proof-carrying operations is an analogy to proof-carrying code, scoped to industrial workflows: a proposed action must carry enough structured evidence for an independent verifier to admit, block, or escalate it. It is not a claim of universal mathematical safety.
- The R-501 example is an illustrative plant workflow, not a production case. Several systems each hold partial truth about the same equipment, and trusted action requires resolving those facts into one governed model.
- Digital twin and digital thread are used in the Digital Twin Consortium sense: a synchronized virtual representation, and a dependable lifecycle information flow across systems, respectively.
- The architecture deliberately separates probabilistic assistance from governed decision admission. The agent may help assemble and explain the case; the verifier admits, blocks, or escalates the action.
References
- George C. Necula and Peter Lee, "Safe Kernel Extensions Without Run-Time Checking," 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI '96), 1996. The paper that introduced proof-carrying code. usenix.org
- George C. Necula, "Proof-Carrying Code," 24th ACM SIGPLAN Symposium on Principles of Programming Languages (POPL '97), 1997. dl.acm.org
- Digital Twin Consortium, definitions of digital twin and digital thread (updated October 2024). digitaltwinconsortium.org
- National Institute of Standards and Technology, "Artificial Intelligence Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, January 2023. nist.gov