Decisions, Not Execution: The Layer Observability Misses
A trace shows the one tool that got called. The decision is in the three that didn't.
The last two pieces ended on the same line — tracing what happened is no longer enough, what matters are decisions, not execution. That’s a diagnosis. This is the part I owe you next: what the decision layer actually is, and what it would have to capture to be one.
I’ve written twice now about the gap between what we can see in an AI system and what we can explain. The argument lands in the same place each time: execution is observable, decisions are not, and as systems get more autonomous, the distance between those two things widens. Readers agree. Then they ask the obvious follow-up, which I’d been avoiding: okay — so what’s the thing that closes it?
“Decisions, not execution” is a good slogan and a bad spec. If the decision layer is real, it should be definable — it should have primitives, a shape, and a clear line separating it from the observability layer we already built. Otherwise, it’s just a complaint with better production values.
So let me define it.
A layer is a place where a question gets answered
We already added one layer to the AI stack in the last two years, and it worked. Observability gave us a place to answer what the system did. Before it, that question was genuinely hard; outputs appeared, and the path that produced them was a black box. Now we can trace requests end-to-end, replay agent runs, and decompose a failure into discrete steps. The question *what happened* has a home.
The decision layer is the place where a different question gets answered: why did the system choose what it did, and was that choice sound?
That question currently has no home. It gets answered, when it gets answered at all, by an engineer staring at a clean trace and reconstructing intent from signals that were never designed to carry it. The work happens — it just happens in someone’s head, ad hoc, unrecorded, and unrepeatable. A layer is what you have when that work stops living in people’s heads and starts living in the system as structured, queryable state.
The reason observability can’t simply be extended to cover this is that it’s pointed at the wrong object. It captures the path that was taken. The decision layer has to capture the selection of that path — and selection is exactly the thing that gets collapsed away the moment an output is produced.
The primitives of a decision
Here’s the definitional core. A decision, in an AI system, is not the output. The output is the last step. The decision is everything that produced it, and it has parts. If you want to represent a decision as a first-class object — call it a decision record — these are the fields it has to carry, none of which a trace records today:
Interpretation. How the system read an ambiguous input. The same user message can be understood as a question, a command, or a request to take action. The interpretation chosen determines everything downstream, and it’s invisible — there’s no log line that says “I read this as a command.”
The considered set. The alternatives that were live at the moment of choosing. Which tools were plausible, which responses were in contention, which retrievals were candidates. A trace shows the one tool that got called. The decision is in the three that didn’t.
Weighting. Which signals in the context actually drove the choice, and which were present but ignored. Context is not flat — user intent, conversation history, system instructions, and retrieved documents all sit in the input, but they did not all count equally. The trace shows them as equally present. They weren’t.
Confidence. How close the call was. An output produced by a strong, stable preference and an output produced by a near-tie between two alternatives look identical once realized. Token-level logprobs survive generation — but the margin that matters here is at the level of the *decision* (this ordering versus that one, this tool versus that one), and that margin isn’t sitting in the single realized run. It has to be reconstructed, not read off.
Context state. Not the raw context window, but its *salience over time* — which earlier assumptions are still live, which have decayed, which should have decayed and didn’t. This is where the slow failures live: drift, context erosion, a stale assumption steering reasoning long after its relevance passed.
That’s the spec. A decision layer is whatever captures those five things as first-class signals rather than leaving them to be inferred. The point of listing them is that you can now check any proposed solution against the list. “Better logging” captures none of them. “Replay the run” re-executes the realized path and recovers none of the alternatives. “Log the chain-of-thought” gets you a narrated path, still no considered set, no weighting, no margin. The list is a filter, and most of what gets pitched as a fix doesn’t pass it.
The hardest of the five is the considered set, and it’s worth being honest about why. It’s defined by absence — the alternatives that didn’t happen — which makes it the one most open to the charge that you’re just inventing plausible-sounding options after the fact. The defense is that the considered set isn’t invented; it’s bounded. The tool schema, the retrieval candidates, the response types the system can produce — these define the space of options that were structurally available at that step. You recover which ones were live by re-sampling the same decision point under controlled perturbation of the context and watching what the system actually reaches for. That’s an estimate, and a falsifiable one. It is not a guess about what could have happened in the abstract; it’s a measurement of what the system does reach for when you hold the step fixed and vary what feeds it.
What it looks like, concretely
Take the failure I described last time: an agent calling the right tools in the wrong order, only under certain combinations of context and history. Clean trace on every failing run. Here is what the trace gave me:
step 1 user_message "can you update the account and notify them"
step 2 tool_call update_account(id=831)
step 3 tool_result 200 OK
step 4 tool_call send_notification(id=831)
step 5 tool_result 200 OK
step 6 final_response "Done — account updated and the customer notified."Every line is correct. Every line is also useless for the question I had, which was why this order, in this conversation, when the same agent sequenced it correctly a thousand other times. The trace records that a path was taken. It has nothing to say about why that path was selected over the one I wanted.
Now here’s the same moment as a decision record — and read it for what it is: not a readout, a reconstruction. Nothing in the block below was measured off the failing run. Every value is an estimate produced by a named method, and I’ve tagged each one so the block can’t pretend otherwise. This is the engineer’s-head interpretation written down as data — explicit, falsifiable, and wrong in a legible way when it’s wrong:
decision ordering: update→notify vs notify→update
interpretation read "update and notify" as sequential, not atomic
└ est. from output + tool-call structure
considered_set [update→notify, notify→update, single transactional call]
└ est. from tool schema + re-sampling the step under perturbation
weighting prior-turn rule ("confirm before any write") → low salience
system rule ("notify only after confirmed write") → never retrieved
└ est. from ablation: drop each signal, observe ordering shift
confidence ~0.31 margin between top-two orderings (marginal)
└ est. from an ensemble over dozens of replayed re-runs,
not a within-run readout
context_state turn-2 confirmation assumption still live at turn 9 (stale)
└ est. from assumption-tracking pass over the conversationOne note on the methods those tags name — re-sampling, ablation, the ensemble of re-runs: every one of them runs against replayed or mocked execution, never the live system. That’s what makes it safe to re-probe a step that once issued a real write like update_account, and cheap enough to run at the volume reconstruction needs.
That block is not a log, and it’s not telemetry. It’s a structured claim about how the choice was made — and because it’s structured, it points somewhere: marginal ordering, a decisive rule that was never retrieved, a stale assumption from seven turns back. In the real case the fix was the retrieval strategy and prompt structure, exactly there. The trace took me a week. The decision record is that week of interpretation, written down once, in a form the next person — or the next automated check — doesn’t have to redo from scratch.
I want to be exact about the claim, because the whole piece turns on it. A decision layer does not make decisions directly observable; nothing does. What it does is take the reconstruction that currently happens informally, in an engineer’s head, and force it into a structured, persistent, derivation-tagged form — something you can store, query, compare across runs, and disagree with on the merits. The values are estimates and the block says so. That’s the move. Not certainty. Structure, and honesty about its provenance, where there was neither.
Why this is a layer and not a feature
It would be easy to read all of this as “add a few more fields to your traces.” It isn’t, and the difference matters.
A feature answers an existing question better. A layer answers a question the layer below it structurally cannot. Observability operates on execution — its atomic unit is the step. You can enrich steps indefinitely and never get a considered set, because the alternatives were never on the path; they’re defined by their absence from it. The decision layer’s atomic unit is the choice, and a choice is a relationship between the path taken and the paths that weren’t. That object doesn’t exist at the execution level. You can’t tack it on. You build above it.
This is also why the decision layer sits where it does in the stack — above execution, below judgment. Execution tells you the system called send_notification. Judgment tells you whether notifying the customer was the right business outcome. The decision layer is the missing middle: it tells you the system chose to notify on a marginal ordering, on a stale assumption, having never retrieved the rule that should have governed it. That’s not the same as knowing the outcome was wrong. It’s knowing whether the decision was sound regardless of how the outcome happened to land — which is the only thing that lets you tell a good system that got unlucky from a bad system that got lucky.
What changes once you have it
Three things move, and they’re the three that the execution view keeps fumbling.
Evaluation stops being output-graded. Most evals score whether the final answer was right. But a marginal decision that happened to produce a correct output is a latent failure wearing a passing grade. Decision records let you grade the quality of the choice independent of whether it got lucky — how close the margin was, whether the decisive context was actually retrieved, whether the considered set contained the right option at all. That’s the difference between measuring outcomes and measuring decisions.
Drift becomes visible before the outputs go bad. The slow failures — behavioral drift, context erosion — are invisible at the output level precisely because no single output looks wrong. At the decision level they show up as a distribution shift: margins narrowing across runs, the same stale assumption recurring, decisive signals quietly dropping out of the considered set. You see the decisions degrade before the answers do.
And that same view is where the cost is hiding. The agent that loops — re-retrieving, re-deciding, burning tokens to relitigate a choice it should have made cleanly the first time — is making a sequence of marginal decisions, and a marginal decision is the unit of wasted spend. You cannot bill that to a line item from a trace; the trace just shows more steps. From a decision record, narrow margins and unstable considered sets are the signal that the system is paying to think in circles. It’s the one beat in here a team feels directly in the bill.
Accountability gets an object to point at. As these systems take consequential actions, “the model decided” stops being an acceptable end of the sentence — for operators internally, and increasingly for anyone governing the system from outside. What an audit needs is an artifact: not the output, not the raw trace, but a record of how the choice was formed and how sound it was. Replay shows you it happened again. A decision record is what lets you ask whether it should have.
In closing
I’ve spent two pieces arguing that decisions, not execution, are what shape behavior. The honest gap in that argument was that I never said what a decision is, concretely enough to build toward. So: it’s interpretation, a considered set, weighting, confidence, and context state — five things a trace structurally cannot hold, each recoverable only as a tagged estimate, and all of them what the decision layer exists to make first-class.
I’m not attached to the name. “Decision layer,” “behavioral intelligence,” “decision quality” — the market will decide which label sticks, and it’ll pick based on which one names the problem people actually feel. The problem is the durable part, and it hasn’t changed since the first piece: teams can trace every step and still can’t explain the decision — why the system chose what it did, whether the choice was sound, or where it’s quietly wasting money making marginal ones.
What’s changed is that the gap now has a shape. Not better logging. Not deeper traces. A layer above execution whose job is decisions — reconstructed, structured, derivation-tagged, and queryable, so the interpretation stops happening in someone’s head and starts happening in the system, where you can check its work.
The execution layer told us what happened. The next one has to tell us why.

