We Built AI Observability. It Still Doesn’t Tell Us Why Systems Fail
A deeper look at the gap between observing AI systems and actually understanding their decisions
In the last piece, I wrote about how AI systems fail in ways that don’t show up cleanly in logs or metrics.
The natural response to that problem has been a wave of AI observability tools. Today, we can trace requests end-to-end, inspect prompts, and replay entire agent runs with a level of visibility that didn’t exist even a year ago.
And yet, when something goes wrong, teams still end up in the same place, trying to explain why the system did what it did.
Observability improved visibility. It didn’t solve understanding.
The Current State of Observability
Contemporary AI systems are significantly more observable than earlier generations of deployed models.
Modern observability infrastructure enables reconstruction of execution paths at a granular level, including prompt composition, intermediate tool interactions, and final outputs. In many cases, entire agent runs can be replayed, allowing inspection of how a system progresses from input to output across a sequence of decisions.
This shift has materially changed how production systems are analyzed. Failures that previously appeared opaque can now be decomposed into discrete execution steps, each of which can be inspected and compared across runs.
As a result, AI systems have become increasingly inspectable at the level of execution.
Inspectability at the level of execution does not imply understanding at the level of behavior.
Limits of Execution-Level Observability
Execution Does Not Encode Reasoning
Execution traces capture the sequence of operations performed by a system, but do not encode the reasoning process that led to the selection of those operations.
In deterministic systems, execution paths are a direct reflection of the underlying logic. Given a fixed input, the sequence of operations fully specifies how an outcome was produced. Inspecting execution is therefore often sufficient to explain system behavior.
Autonomous AI systems diverge from this model. Behavior is produced through probabilistic reasoning, where multiple latent decision paths may exist for a given input. The observed execution reflects only the realized path, not the alternative reasoning paths that were implicitly available or the relative weighting that led to their selection.
This makes it possible for identical or near-identical execution traces to arise from materially different reasoning processes. Conversely, similar reasoning processes may produce divergent execution paths under small variations in context.
This decoupling limits the ability to attribute outcomes to specific decision logic based on execution data alone.
Outputs Collapse Underlying Uncertainty
Final outputs represent a collapsed form of the system’s internal reasoning process, obscuring the range of alternatives, assumptions, and intermediate signals that contributed to the result.
During generation, probabilistic systems evaluate multiple potential continuations, each associated with varying likelihoods and contextual relevance. The selected output reflects only the realized sequence, not the distribution of alternatives that were considered or the degree of uncertainty present at each step.
This collapse removes visibility into how strongly different factors influenced the outcome. Signals that were weakly preferred, strongly weighted, or in conflict are no longer distinguishable once the output is produced.
As a result, two outputs that appear similar at the surface level may reflect different underlying levels of certainty or reasoning stability, while outputs that differ may originate from closely related decision processes with small variations in input or context.
This loss of intermediate structure limits the ability to assess confidence, identify ambiguity, or determine whether a given output reflects a robust decision or a marginal selection among competing alternatives.
Context Is Not Represented as a First-Class Signal
While observability systems capture inputs and intermediate states, they do not explicitly represent the contextual structure within which decisions are made.
In practice, context is composed of multiple overlapping signals, including user intent, interaction history, system instructions, and domain-specific constraints. These signals are present within the input space, but are not encoded in a way that distinguishes their relative importance or influence on the outcome.
This makes it difficult to determine which elements of context were materially relevant to a given decision, and which were incidental or ignored. Inputs that appear equally salient at the surface level may have been weighted differently during reasoning, while strongly influential signals may not be identifiable post hoc.
This lack of explicit representation also obscures how context evolves over time. In multi-step interactions or agent workflows, earlier assumptions may persist, decay, or be superseded by new information, but these dynamics are not directly observable through execution traces or final outputs.
Without a structured representation of contextual relevance, interpretation of system behavior remains dependent on inference, rather than direct observation.
From Execution to Decisions
The limitations of execution-level observability reflect a deeper mismatch between how AI systems are analyzed and how they operate.
Execution traces, outputs, and captured inputs describe what a system did, but do not provide a structured account of how decisions were formed. As a result, explanation remains indirect, requiring reconstruction of reasoning from signals that do not explicitly encode it.
This requires a shift in perspective. Rather than treating system behavior as a sequence of executed steps, it becomes necessary to interpret behavior in terms of decisions.
A decision, in this context, is not just an output. It is the result of interpreting inputs, weighing competing signals, and selecting an action under uncertainty.
Decisions are not directly observable. They must be inferred from the relationship between context, intermediate state, and final output. This inference introduces ambiguity, as multiple plausible decision processes may be consistent with the same observed behavior.
As AI systems become more autonomous, this distinction becomes increasingly important. Failures are less often attributable to incorrect execution and more often to misaligned or unstable decisions within otherwise valid execution paths.
In Closing
The inability to directly observe reasoning, preserve uncertainty, or represent context as a structured signal introduces a persistent gap between what can be seen and what can be explained.
In practice, this gap places increasing pressure on interpretation. Engineers, product teams, and operators are required to infer decision processes from signals that do not explicitly encode them, often under conditions where outcomes carry real-world consequences.
As systems become more autonomous, this limitation becomes more difficult to ignore. Explanation is no longer a secondary concern. It is a prerequisite for building systems that can be evaluated, trusted, and controlled.

