
What your agents do when you're not looking
AI agents fail in ways traditional monitoring can't see. Here's what observability really means for an agent and where to start.
You kick off a big refactor. Dozens of files, a tangle of imports, tests to keep green, the kind of change you'd normally block out a few days for. You hand the task to Claude Code or Codex, it uses subagents where it helps, and you go get lunch.
Forty minutes later you're back. The branch is green. The tests pass. There's a clean PR waiting with a tidy description. It worked.
Then a small, uncomfortable question surfaces: what did it actually do for forty minutes?
How many model calls did that take? How many subagents did it spin up, and did any of them get stuck retrying the same failing command? Did it read a config file with a live API key in it? And the bill: was that two dollars of work, or two hundred?
You don't know. Nobody knows. The work is done and the record of how it got done has already evaporated.
The agent can fail perfectly. No exceptions, no 500 error codes, just a green checkmark on your task, with a wrong answer or a runaway cost sitting right underneath.
Most of us have decided to live with this. The interesting question, again, is what changes when you don't have to.
Why agents need their own dashboard
Traditional monitoring was built to answer one question: did the request succeed? For a web service, an HTTP 200 and a sub-200ms latency is mostly the whole story.
An agent breaks that model completely. A single request you type fans out into dozens of model calls, tool invocations, subagent handoffs, and retry loops, an entire tree of work, not just a single hop. Your performance monitoring dashboard sees one process that ran for forty minutes and exited zero. Everything that actually matters happened inside that span, and it can't see in.
This is why agent observability has become its own discipline, with its own data model and its own failure modes. And the failure modes are the tell: most agent incidents aren't the model being wrong. They're tool calls that failed silently, context that got truncated at the worst moment, and loops that kept burning tokens long after they stopped making progress. Standard tooling is blind to all three.
Cost is the same story in miniature. LLM APIs are stateless, so every turn re-sends the entire conversation so far as input, and you're billed for it again. A session carrying 100K tokens of history pays to reprocess all 100K on each new turn (prompt caching softens the repeated part, but only if you're getting cache hits). Add subagent fan-out and a few retry loops and the curve stops being linear. A focused bug fix might cost 20K tokens; a multi-file refactor with subagents can quietly reach four figures in an afternoon. The hard part was never the size of the bill. It's that nothing told you what caused it until the invoice arrived.
What you actually need to see
If the failures live inside the agent's run, that's exactly where you need to be looking. Four things to capture:
The shape of what happened. Every step the agent took, laid out like a tree: the task at the top, and under it each model call and each tool it reached for. Walk down it and you can see the choice it made at step seven, plus where the conversation got too long and older context was trimmed away, which is often where an agent quietly loses the plot.
Cost, speed, and errors. For each step: what it cost in tokens, how long it took, and whether it worked. The time usually goes where you'd least guess, and a model that answered badly is a different problem from a tool that failed, so keep the errors separate by kind.
The quality of the result. Knowing what the agent did is not the same as knowing it did the right thing. Check it three ways: small tests on individual steps, a "model grading the model" pass for the judgment calls, and a steady sample of real sessions to catch quality slipping over time.
What it exposed. An agent sees everything you hand it, including things that should never have been there: an API key in a config file, a customer's email in a log. Watching one means catching that sensitive data as it passes through, so a leak shows up as a finding instead of a surprise.
The encouraging part is that this is standardizing. OpenTelemetry's GenAI semantic conventions now define a shared vocabulary for all of it (gen_ai.usage.input_tokens, gen_ai.request.model, gen_ai.response.finish_reasons), so a span from one framework looks like a span from any other. Both Claude Code and Codex already emit OTel metrics and log events, with trace support arriving. The plumbing is being laid. (If you've read our mission, you know observability is one of the six things we instrument from day one inside the businesses we work with, and this is that pillar, up close.)
Start with your own agents
Everything above is the full build: the traces, the quality checks, the shared dashboard a whole team watches to understand how agents are being used. That's the enterprise end state, and it's most of what we stand up for clients.
But there's a layer underneath it that almost nobody instruments, the one closest to you. The individual developer, at their own machine, running twenty agent sessions a day. The team dashboard, if it exists at all, aggregates those into a number for a cost center. It throws away the thing you needed: a clear read on what your own agents just did.
And here's the quiet absurdity: the data already exists. It's sitting on your disk right now. Claude Code writes every session as JSONL under ~/.claude/projects/. Codex keeps its own under ~/.codex/sessions/. The full transcript, every tool call, every token, already captured. It's just locked in silos that don't talk to each other and don't survive past the moment you close the terminal. The built-in /cost command tells you about the session you're in, then forgets.
So the record of what your agents do all day is right there: fragmented, unsearchable, and effectively invisible.
Reading what's already on your machine
This is the gap we built Actvt for.
Actvt is a native macOS app that reads the agent sessions already on your Mac and turns them into something you can actually use. It unifies Claude Code and Codex into one history that is searchable, filterable, and permanent. Open any session and you get the full transcript rendered cleanly: thinking blocks, tool calls, subagents, the whole tree. The shape of what happened from the section above, except human-readable and sitting one keystroke away.
Every Claude Code and Codex session in one window, searchable, with the cost attached.
The failure modes from earlier each get a front door. Smart Collections surface your Expensive, Has Errors, and Has PII sessions automatically: the runaway loop, the silent tool failure, and the leaked secret, found for you instead of by an invoice. The Cost Overview does the aggregation /cost can't: spend by project, a daily heatmap, and cache hit ratio, which is the single biggest lever you have for cutting agent cost.
And because the whole point of watching your agents is that they see everything you paste, the sensitive-data scan runs entirely on-device. Emails, keys, cards, IDs: flagged locally, nothing uploaded, ever. Observability without the surveillance.
The part that surprised us most isn't a feature. It's what visibility does to behavior. Once you can actually see how an agent spends its tokens and what it reaches for, you start writing tighter prompts and structuring projects so the agent wastes less. You can't optimize a process you've never watched run.
Closing
There's a version of that forty-minutes-and-a-green-checkmark scene in every developer's week now. The work lands; the story of how it landed disappears. For a whole organization, closing that gap looks like traces, eval harnesses, and a governed view across every agent, the plumbing we lay inside the businesses we work with. For one developer at their desk, it starts with being able to open up what your agents did today and read it.
We think both layers matter, and we're building at both. If you've got your own version of the surprise bill, or you just want to finally see what your agents do when you're not looking, we'd love to hear how you're thinking about it.
Oye Collective builds production AI agents inside real businesses, and Actvt gives you observability into your own. Reach us at oyecollective.com.
Putting agents into your business?
We help enterprise teams move from agent demos to production. Free assessment call, no commitment.
Book an Assessment