Vasyl Tretiakov

Posted on Jun 17 • Originally published at vasyltretiakov.dev

Compiling the Process, Not the Code: a machine-checked workflow for coding agents

#ai #llm #agents #programming

The deterministic checks guarding an agent-built project are a compiler for the workflow — and the load-bearing half of them compile the process, not the code.

Halfway through a Tuesday in early June, after I'd spent the morning turning a handful of written-down conventions into scripts that block a commit when they're violated, I typed a half-formed thought at my coding agent instead of another task:

it feels like with script gates we're building a compiler for our coding-agent-enabled workflow

There were around three dozen of these checks in the project by then. None of them compiles anything in the usual sense. They grep, they parse, they diff one artifact against another and exit nonzero when two things that must agree don't. But the feeling was specific and it wouldn't go away, so I want to take it seriously: what is true about it, and exactly where it stops being true. The place it breaks is the useful part.

The asymmetry under the feeling

The agent's first move was to tell me "compiler" was the wrong word for the right intuition. The precise version is a generator/verifier split, and a compiler is one instance of it.

An LLM is a high-throughput, low-consistency generator. It is prolific and genuinely creative, and it is unreliable about rule forty-seven under context pressure. It will use whichever term it saw most recently, in whichever file it's editing, and be fluent and confident while doing it. The cheapest way to make an unreliable generator reliable is not to lecture the generator. It's to put a deterministic, cheap verifier in front of it and exploit an asymmetry: checking that a constraint holds is far cheaper, and far more trustworthy, than generating output that satisfies it on the first try every time. A compiler is that pattern frozen around a language. What I was building was the verifier sized to my generator's specific failure modes.

I should concede immediately that none of that asymmetry is new. It is the oldest idea in our tooling. Type checkers, linters, continuous integration, and the humble test suite all exist because verifying a property is cheaper than producing code guaranteed to have it. The compiler metaphor isn't new either; people have reached for it about prompts and pipelines for a while. So if the essay were just "checks make agents more reliable," it would be a familiar and slightly boring claim. The interesting part is narrower, and it lives in the gap between what a compiler promises and what these checks actually deliver.

Why it feels like a compiler and not just a pile of linters

Most lint suites are a pile. You accumulate rules, you run them, none of them knows about the others. The tell that this had become something more compiler-shaped was that I had started writing checks whose job was to police the other checks: a script that reads every gate and verifies each one is registered, named to convention, and wired into the commit hook the same way. Rules about the rules, mechanically enforced.

That meta-layer is the line between a lint pile and what a compiler-minded person would call static semantics. A pile says "these specific tripwires didn't fire today." A semantics says "the rules themselves are well-formed, so a whole class of malformed rule can't exist." Crossing that line is what makes the workflow feel like it has a grammar rather than a checklist. It is also, as it turns out, exactly the feeling you should be most suspicious of.

Three places the metaphor breaks

A careful reader should distrust a flattering metaphor, so here is where this one falls apart. There are three breaks, and each one is load-bearing.

These are linters, not a type system. A compiler makes a guarantee with a specific shape, the one Robin Milner named soundness: if a program type-checks, an entire class of error is impossible — "well-typed programs cannot go wrong." My checks make a guarantee with a much weaker shape: these particular tripwires did not fire on this particular commit. A spelling denylist catches the words on the list and is blind to the synonym nobody listed. A check that greps a response shape proves the shape is present, not that the behavior behind it is right. The space between the invariant I care about and the grep that approximates it is precisely where the real bugs continue to live, and a green suite hides that space rather than illuminating it. I spent a whole companion essay, Gates Earned From Failure, on the hazard that follows from this: a passing check converts active vigilance into misplaced trust. The compiler-specific version of the point is just that the word "compiler" smuggles in a soundness guarantee these checks have not earned.

It's accreted, not designed — there's no source grammar. A real compiler starts from a language definition. Someone wrote down what a well-formed program is, and the checker enforces that document. My "well-formed workflow" was never written down. It is defined, entirely and only, by the union of whatever checks happen to exist on a given day, and each of those checks was born reactively from one incident that had already bitten me. So the de-facto language the agent is being held to is "whatever passes today," and the gaps in it stay invisible until the next incident reveals one. The always-loaded instruction file is the closest thing to a grammar I have, and it is a prose approximation, not a specification.

The programmer and the language designer are the same agent. This is the one that should worry you most. A real compiler is adversarially independent of the code it judges; the people who designed the language had never seen your program. In my setup the same agent writes the artifact and writes the gate that will judge the artifact's future versions. A gate can therefore faithfully encode the author's blind spot and then certify work as clean against it, forever. Enforcing over discipline is a good rule with a hard ceiling: a gate only ever catches what someone already thought to mechanize. Birgitta Böckeler makes the same observation one layer up, about tests, when she notes that the feedback is weaker than it looks once "the agent also generated the tests." A verifier the generator authored is not independent of the thing it verifies.

The half that compiles the process

Here is the part worth the essay. When I actually looked at my three dozen checks and asked what each one verifies, a large fraction of them were not checking the code at all. They were checking the process.

One enforces that a task needing a written specification can't be picked up in an implementation phase. One flags a task whose declared dependency has gone stale. One enforces the lifecycle of an amendment: proposed, reviewed, accepted, in that order, with no skipping. None of these reads a line of program logic. Each one encodes a piece of software-development-life-cycle (SDLC) discipline that, on a human team, lives in nobody's compiler and everybody's head. We call it professionalism, or being senior, or knowing how things are done here.

Human teams almost never mechanize that layer, and the reason is economic, not cultural. People hold context across days and weeks, judgment tells them when a rule applies, and their name on the commit supplies accountability. Those three things are exactly what let a team write "we follow the amendment process" in an onboarding doc and then assume compliance. Building a script to enforce it would cost more than the drift it prevents on a small team that mostly remembers anyway.

With an agent, that economic calculation inverts, hard. The generator has no continuity between sessions, so the drift rate is high. And the same agent will author the enforcing check for you in a few minutes, so the cost of mechanizing is near zero. When drift is frequent and gates are nearly free, you are rationally pulled to formalize methodology that was never worth formalizing in the entire prior history of software teams. That is the actual discovery hiding under the compiler feeling. Not that I added a spell-checker for my domain vocabulary, but that the economics of agent-directed work quietly made it worth compiling the development process itself into machine-checked constraints.

The instruction file stops being a manual

The same hour I sent that "compiler" message, I followed it with a request that only makes sense if the metaphor is real:

can we retire or compress or move (to HANDBOOK) some instructions… Errors and warnings already explain the error briefly and point to a HANDBOOK section with more details

Once a check owns a convention, the paragraph in the always-loaded file that used to beg the model to remember that convention is dead weight. The gate does the remembering. So you delete the prose and let the error message carry a pointer to a longer explanation that gets read only when the check actually fires. By the end of that session the project had settled into three clean layers: an always-loaded file that states directives, a handbook that explains mechanism, and the gates that enforce. The prose had stopped duplicating the type-checker.

That is the compiler architecture made literal, and it reframes what the instruction file is for. It is not a manual the agent is trusted to follow. It is closer to a grammar: the small, stable statement of intent that the enforcement layer makes binding. Correctness moved out of the prose and into the checks, and the prose got shorter and more honest about its job.

Which invariants earn a type rule

The skill this leaves you with is not "write more gates." A check family only ever grows, and a verifier that accepts everything is as useless as one that rejects everything. The real judgment is which invariants are cheap enough, sound enough, and quiet enough about false positives to deserve promotion into a type rule, and which should stay discipline. That judgment is a cost test, and I worked it out in detail in Gates Earned From Failure rather than repeat it here: build the rule when the evidence the drift is real, times the cost of that drift, outweighs the standing cost of the check.

The compiler framing adds one sharp reminder to that test. Over-strict typing is a real failure mode, not a safe default. The same morning all of this crystallized, I dropped a gate I'd built, because it fired on too many legitimate cases to be worth its noise. In compiler terms that gate was refusing to compile sound programs, and a type system that rejects valid code is exactly as broken as one that admits invalid code. A rejected-but-correct commit teaches the same lesson a missed bug does: the rule was wrong. Calibration runs in both directions.

Honest limitations

This is one person's experience with one codebase of roughly 150,000 lines, where I own the domain model and the workflow both, so the language designer and the programmer being the same agent is my daily reality rather than a thought experiment. That collapse is the project's sharpest limit, and I can't tell you I've escaped it. I can only tell you I try to audit what isn't gated as deliberately as I build what is, because the suite measures the absence of known tripwires, never soundness.

There is also no source grammar, and I'm not sure a healthy version of this ever gets one. The honest description of the system is not "a compiler for my workflow" but "an accreting pile of verifiers that has grown a thin layer of self-governance and started, in places, to behave like one." Whether that hardens into something with a real specification, or stays a well-tended pile, is a question I haven't answered. And the whole economic argument turns on a solo author hitting his own drift the same day he builds the gate. On a team the context cost compounds and the false-positive triage lands on people who never felt the original failure. I have evidence for the solo case and a suspicion the shape survives the move, which is not the same thing.

Why this generalizes

Strip away my particular domain and the pattern is plain. Any agent working on a governed system accumulates drift faster than a human would, and will author the check that catches it for almost nothing. The first checks you reach for are the obvious ones, over the code and the vocabulary. The coupling-check pattern that catches a term renamed in three surfaces and missed in the fourth is one I've written about on its own, in Couple Both Ways. But the ones that surprised me, and the ones I'd point a skeptical reader toward, are the checks that compile the process — the stage gates and lifecycle rules and staleness tripwires that encode how the work is supposed to flow. Human teams carry that in their heads because mechanizing it never paid. Directing an agent is the first context I've worked in where it does.

This essay was written by directing a coding agent over the project it describes; I direct and judge, the agent drafts and argues back. The reframing from "compiler" to "generator/verifier," and the three ways the metaphor breaks, came out of that argument — quoted here roughly as it happened.

I build governed agent systems at the intersection of Contact Center software and AI. If that's a problem you're chewing on, I'm reachable on LinkedIn.

References

Vasyl Tretiakov, "Rails, Not Rules: Enforcing a Coding Agent's Domain Vocabulary with Checks," vasyltretiakov.dev, 1 Jun 2026 — companion essay; the enforce-over-discipline origin.
Vasyl Tretiakov, "Couple Both Ways: bidirectional checks against silent drift," vasyltretiakov.dev, 9 Jun 2026 — companion essay; the coupling-check pattern in depth.
Vasyl Tretiakov, "Gates Earned From Failure: a cost test for agent guardrails," vasyltretiakov.dev, 14 Jun 2026 — companion essay; the cost test and the green-checkmark hazard.
Birgitta Böckeler, "From MCP and Vibe Coding to Harness Engineering: How Did AI Native Engineering Evolve in One Year," InfoQ podcast, 8 Jun 2026 (accessed 14 Jun 2026). Source of "the agent also generated the tests."
Robin Milner, "A Theory of Type Polymorphism in Programming," Journal of Computer and System Sciences 17(3):348–375, 1978 (accessed 17 Jun 2026). Origin of the type-soundness guarantee ("well-typed programs cannot go wrong") that breakage #1 says these checks have not earned.
Eric Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003. ISBN 978-0321125217. (The ubiquitous-language lineage the terminology checks enforce.)

Published at vasyltretiakov.dev.

DEV Community