Academy

Lessons From Millions of Agent Interactions

AI Agents are becoming autonomous, what does that mean?

Tom W.Tom W.
Scout A. TeamScout A. Team
Share article:

What Anthropic and LlamaIndex research reveals about building agents

The AI agent landscape is evolving faster than most practitioners can track. This week brought significant announcements from two of the most influential voices in the space: Anthropic shared new research on real-world agent autonomy, while LlamaIndex outlined how coding agents are reshaping software engineering. If you're building agents or planning to, these insights deserve your attention.

The Big Picture: Agents Are Here, but We're Still Learning

Let's start with the uncomfortable truth: we know surprisingly little about how people actually use agents in production. Anthropic's latest research, which analyzes millions of human-agent interactions, is one of the first serious attempts to fill this gap. The findings challenge some assumptions while confirming others.

What the Data Actually Shows

Software engineering accounts for nearly 50% of all agentic tool calls on Anthropic's API. Beyond coding, Anthropic usage spans business intelligence, customer service, sales, finance, and e-commerce, but none of these categories account for more than a few percentage points of traffic.

This concentration tells us something important: agents are proving themselves in domains where output is verifiable. You can run code and see if it works. You can test an API integration. The feedback loop is tight.

In contrast, domains where verification requires the same expertise as production (such as law, medicine, and complex financial analysis) are adopting more slowly. This isn't a failure; it's appropriate caution.

For those who are safety conscious, it's reassuring to note that 73% of agent tool calls appear to involve a human somewhere in the loop. Only 0.8% of actions appear to be irreversible, such as sending an email to a customer. But don't let this breed complacency. Anthropic explicitly notes that frontier agents are already touching security systems, financial transactions, and production deployments. Average usage masks the extremes.

Autonomy is increasing steadily, but not rapidly. In Claude Code, the 99.9th percentile turn duration nearly doubled between October 2025 and January 2026, from under 25 minutes to over 45 minutes. Notably, this increase was smooth across model releases. If autonomy were purely a function of model capability, we'd expect sharp jumps with each new launch. Instead, the steady trend suggests multiple contributing factors: power users building trust, increasingly ambitious tasks, and ongoing product improvements.

The takeaway for builders is that there's a significant deployment overhang. Models are capable of more autonomy than they exercise in practice. Your users will grant more independence as they develop trust.

How Users Actually Supervise Agents

This is where Anthropic's research gets genuinely useful for product design.

New users tend to approve each action before the agent can execute it. Among new Claude Code users, roughly 20% of sessions use full auto-approve. By 750 sessions, this increases to over 40%. But here's the counterintuitive finding: experienced users also interrupt more often. The interrupt rate increases from 5% of turns for new users to around 9% for experienced users.

This isn't contradictory. It's a shift in oversight strategy. Experienced users move from "approve everything" to "let it run, intervene when needed." They're actively monitoring, not passively approving. For product builders, this means designing for evolution. Your UX should support approval-based workflows for new users while building trust-based, monitoring-driven workflows for experienced users who want to intervene with precision.

On complex tasks, Claude Code asks for clarification more than twice as often as on simpler tasks. More importantly, Claude-initiated stops increase faster than human-initiated stops as task complexity rises. This suggests that training models to recognize their own uncertainty is an important safety feature, not just a nice-to-have one. The agent that knows when to ask is more trustworthy than the agent that plows ahead. If you're building agents, invest in uncertainty calibration. An agent that surfaces issues proactively also complements external safeguards, such as permission systems and human oversight.

The LlamaIndex Perspective: Engineering Is Changing

While Anthropic focuses on how agents behave, LlamaIndex's Jerry Liu is thinking about what agents mean for engineering organizations. His memo to LlamaIndex's engineering team is blunt: "Coding agents are fundamentally changing software engineering in terms of velocity, role, and org structure."

The shift he describes is dramatic. Prioritization, engineering planning, and implementation tasks used to be divided between EMs, PMs, senior ICs, and junior ICs. Now ICs are expected to handle all product prioritization, product specification, and implementation. Two trends are driving this shift. First, coding agents have cut implementation costs to almost zero and narrowed the role of engineers to writing prompts. Second, LLMs and sub-agents have reduced much of the PM workload associated with synthesizing feedback. Now, the main job of any engineer is to be an end-to-end product owner, translating requirements into specifications and delegating tasks to various sub-agents for implementation.

LlamaIndex explicitly tells engineers to offload as much as possible to their favorite tools, such as Claude Code, Cursor, Devin, Codex, and ChatGPT. The company celebrates and shares learnings about "burning tokens," as long as it drives additional productivity. This is a cultural shift as much as a technical one. Organizations that treat AI tools as optional supplements will fall behind those that treat them as core infrastructure.

Long-Horizon Agents: The Next Frontier

LlamaIndex's broader vision extends beyond coding. It predicts that 2026 will be the year agents evolve from workflows to more like employees, continuously monitoring events, collaborating with humans and other agents, and doing work independently.

The key architectural insight is that agents need triggers beyond chat. They must retrigger when the world changes, such as when a deadline approaches or someone comments on or edits a document. Not everything should interrupt a human, so agents need a persistent task backlog that batches, escalates, or holds tasks for approval. The UX should look more like an inbox than a chat box.

This vision aligns with Anthropic's data showing that experienced users want to monitor and intervene selectively, not approve every action.

What Builders Should Focus On

By synthesizing both perspectives, a few priorities emerge:

  • Post-deployment monitoring matters more than pre-deployment evaluation: Pre-deployment tests show what agents are capable of in controlled settings. But many critical patterns can only be observed in production, such as how users develop trust, when they intervene, and what triggers failures. Anthropic explicitly recommends building infrastructure to collect this data. If you're deploying agents at scale, you need visibility into real-world behavior.
  • Design for evolving oversight: Your users will change how they supervise agents over time. Build UX that supports action-by-action approval for new users, batch approval for intermediate users, and monitor-and-intervene practices for experienced users. Instead of mandating specific interaction patterns, focus on whether humans can effectively monitor and intervene when it matters.
  • Train for uncertainty recognition: Agents that recognize their own uncertainty and surface issues proactively are safer than agents that require humans to catch every problem. This is a training objective, not just a product feature.
  • Start with verifiable domains: Software engineering dominates agent usage because outputs are testable. If you're building agents for other domains, think hard about verification. How will users know if the agent got it right?
  • Embrace the organizational shift: If LlamaIndex's experience is representative, the distinction between engineer and product owner will continue to blur. Teams that adapt their structures by flattening hierarchies and expecting broader ownership will capture more value from agents.

What's Coming

Both Anthropic and LlamaIndex see the current moment as just the beginning. Software engineering is the initial point of entry, but expansion into higher-stakes domains is inevitable.

The critical question isn't whether agents will become more autonomous. They will. The real issue is whether we'll build the monitoring, oversight, and organizational structures to deploy them responsibly.

The data suggests we're learning. Users are developing calibrated trust, and models are learning when to stop and ask. The infrastructure for post-deployment monitoring is growing.

For builders, the message is clear: the opportunity is enormous, but so is the responsibility. Build agents that know their limits, design products that support evolving oversight, and invest in understanding how your agents actually behave in the real world.

The agents are here. Now we need to learn how to work with them.

Sources

  • Anthropic: Measuring AI agent autonomy in practice (Feb. 2026)
  • LlamaIndex: Long Horizon Document Agents (Feb. 2026)
  • Jerry Liu LinkedIn memo on coding agents (Feb. 2026)
Tom W.Tom W.
Scout A. TeamScout A. Team
Share article:

Ready to get started?

Sign up for free or chat live with a Scout engineer.

Try for free