benchr Issue No. 07

AI agents, eighteen months in

A skeptic's field report on LangGraph, OpenAI Assistants v2, Anthropic's computer use, and Autogen — what actually works, what still doesn't, and the Frankenstein problem of chaining LLM calls.

· View changelog

Frameworks tested 4 LangGraph, Assistants, CU, Autogen
Test window 18 Months of production attempts
Survived to month 6 12% Of agent deployments
Frameworks worth running 2 Of the four — with guardrails

A billing-recovery agent built on LangGraph was about to re-charge a customer's card for the fourth time on a single disputed transaction when the kill switch caught it. The agent had been running unattended for three hours.

The payment-processor event semantics were subtle enough — refund, partial refund, dispute, disputed-after-refund — that the agent kept picking the wrong retry decision. The kill switch was the only thing that prevented a real loss. That's the texture of agentic work in early 2026: capable enough to look like it works, brittle enough that running it unattended is a real risk. The pitch from eighteen months ago was correct in principle and wildly premature on timeline.

This piece is a field report on the four agentic stacks worth taking seriously right now — LangGraph, OpenAI Assistants v2, Anthropic's computer use, and Microsoft Autogen — measured against three real tasks built and shipped over the past six months. None of these stacks produced a production agent worth running unattended. Two produced agents that work with significant guardrails. The other two taught useful lessons about where the wall currently is.

I went into the eighteen-month window expecting at least one of the four frameworks to be solidly production-ready by now. None of them are, for high-stakes autonomous action. That's not a failure of the frameworks — it's a failure of the underlying capability. The agent loop's reliability is a frontier-model property, not a framework property, and the frontier models aren't there yet.

What the tasks were

Three tasks, each picked because the definition of done is clear and the task isn't trivially solvable by a single LLM call.

Task A: an inbox-triage agent that reads incoming support emails, classifies the issue, looks up the customer's account in a small database, drafts a response, and either sends it directly for low-stakes issues or queues it for human review for higher-stakes ones. Three tools: read database, classify issue, send/queue email.

Task B: a crash-report triage agent that watches an error stream, identifies new clusters of failures, looks up the relevant code paths in a Git repository, and writes a triage note for the engineering team to review the next morning. Four tools: read crash endpoint, search Git, read file from repo, write triage doc.

Task C: a billing-recovery agent that watches a payment processor for failed charges, classifies the failure reason, and either retries the charge after a delay or sends a templated email asking the customer to update their card. Three tools: read payment events, retry charge, send email.

Each is a small business workflow. Each is the kind of thing the agent demos suggest should be trivial. None of them turned out to be trivial.

Worth flagging: this evaluation is based on production deployments at the team scale (single-digit engineers maintaining the agent). Enterprise multi-team agent deployments have different failure modes I haven't tested. The vendor claims about reliability at scale are exactly the kind of thing this piece is skeptical of, but the counter-evidence isn't something I can produce from my own work.

LangGraph

LangGraph is the agent framework with the most working hours behind it in this evaluation. Its model is a graph of nodes — each node either an LLM call, a tool call, or a deterministic Python function — with explicit transitions between them. The framework gives you control over the topology, which is what you need once you've decided that the LLM-makes-all-decisions style isn't reliable enough.

Task A worked. Two long weekends of work, an agent that on a held-out set of 200 emails classified correctly, drafted a response that was willing-to-send, and routed it to the right queue 88% of the time. The 12% failure rate concentrated on weird inputs. Multiple unrelated questions in one email, attachments referenced by name without being attached, replies-to-replies treated as new threads.

Task B partially worked. The agent could identify error clusters and find the relevant code. The triage notes it wrote were inconsistent in quality. Sometimes excellent, sometimes a confused summary of three unrelated bugs. After a week of shipping it, it got turned off because reading the triage notes was taking as long as triaging the bugs by hand would have.

Task C failed. The payment-processor event semantics are subtle enough that the agent kept making the wrong retry decision. After a session where it almost re-charged a customer four times for a single disputed transaction, the agent was shut down and the workflow returned to manual review.

OpenAI Assistants v2

The current major iteration of OpenAI's Assistants API, released September 2024 and still the production line in early 2026. The model is higher-level than LangGraph. Describe tools and instructions, let the platform handle the loop. It's the closest thing in the space to "just describe what you want."

The trade-off is visibility. When the agent does the wrong thing, debugging means reading the logs of which tools were called in which order. OpenAI exposes this but doesn't make it pleasant to navigate. For a working developer, that friction matters more than the development speed gain.

Task A built on Assistants v2 hit comparable accuracy to the LangGraph version in about a third of the development time. The development experience is way faster. When the agent failed in production, the cause was harder to find, which constrained the iteration speed. For low-stakes workflows that's acceptable. For anything where a wrong action costs money or trust, the LangGraph-style explicit control is the right fit.

Anthropic's computer use

A different kind of capability, introduced in Anthropic's computer use announcement. Instead of calling APIs, the agent sees a virtual screen, moves the mouse, types, and reads the result. This opens up tasks with no API surface. Anything in a desktop GUI, anything that requires logging into a website without a clean programmatic interface.

Task A was tested on this stack in a deliberately constrained variant: read emails from a webmail interface, classify them visually, write to a spreadsheet. The point was to see whether computer use is a faster path to an end-to-end agent than the API-call approach.

It worked. Slowly. The agent successfully read 87 of 100 test emails, classified 78 correctly, and wrote 73 to the sheet. The end-to-end success rate was lower than the API-based version, and the latency was way higher. About 40 seconds per email versus 3 seconds. The interesting failure mode: computer use breaks when the UI changes, which it does constantly. The webmail compose interface shifted twice during the testing window, and the agent broke each time until it was re-tuned.

For long-tail tasks with no API that don't change often, computer use is the right tool. For anything run thousands of times against a UI that may change, the API path is structurally more durable.

Microsoft Autogen

Microsoft AutoGen's pitch is multi-agent. Instead of one agent doing everything, compose a team of specialist agents that collaborate. Task B was given the longest leash on this stack: a triage team with a crash-reader agent, a code-searcher agent, and a writer agent.

The result was the most interesting failure in the test. The agents talked to each other constantly. The conversation was internally coherent. The output was worse than the single-agent LangGraph version, because the multi-agent setup introduced new failure modes. Agents disagreeing about classification. Agents handing off incomplete context. Agents getting into mutual loops where each waited for the other to act.

Call it the Frankenstein problem. Adding agents doesn't add intelligence. It adds opportunities for the agents to confuse each other in ways a single agent wouldn't. There may be tasks where the multi-agent decomposition wins. None turned up in this evaluation.

Every agent demo looks magical. Every agent in production looks like duct tape. The gap between the two is where the next two years of work need to happen.

Four frameworks — production-readiness score /100

Across reliability, debuggability, error recovery, tool fluency, total cost.

LangGraph
78
Anthropic computer use
65
OpenAI Assistants v2
58
Microsoft Autogen
42
12% Of agent deployments survive to month 6 in production

One genuine uncertainty: whether OpenAI's upcoming Assistants v3 (announced in their December dev day, not yet generally available) closes any of the gaps named below. The preview demos look strong on debugging visibility. The preview demos always look strong. Worth re-testing when v3 ships.

Where agents actually work in 2026

Three categories where an agent in production is worth trusting today.

First: high-volume, low-stakes classification or routing. What Task A turned out to be. When the cost of being wrong is small and the volume is large, an 88% success rate is a real productivity gain. The wrong answers get caught downstream by humans, retries, or simple sanity checks. The agent pays for itself on the easy cases.

Second: tightly scoped tool calling. Single tool, single decision, clear stopping condition. Search the docs and return the relevant section. Look up the customer record. Fetch the weather and return it as JSON. These are agents in the most generous sense. They're closer to LLMs with a single function call. They work because there's no loop to fall out of.

Third: human-in-the-loop assistance. The agent does the legwork. A human approves the action. That's the model behind every coding assistant that ships real code, and it works because the human catches the failures the agent would otherwise commit.

1. Observe state

Read the world (API, database, screenshot).

2. Plan + reason

LLM picks the next action from a tool list.

3. Execute tool call

Side effects happen here. This is where money disappears.

4. Loop or terminate

Goal met → done. Else → back to step 1. Cap the loop.

  1. Mar 2024 LangChain Agents

    First broadly-used framework. Tool-calling templates that hid the loop.

  2. Aug 2024 LangGraph

    Explicit graph topology. The framework people actually ship.

  3. Sep 2024 OpenAI Assistants v2

    Higher-level API. Faster to prototype, harder to debug.

  4. Oct 2024 Anthropic computer use

    Agent that sees a screen and uses a mouse + keyboard.

  5. 2025 Multi-agent everywhere

    Autogen and friends. Mostly worse than one well-designed agent.

LangGraph

Production Best topology control

Assistants v2

Prototypes Fastest idea-to-demo

Computer use

UI automation Brittle on dynamic UIs

Autogen

Research Multi-agent experiments

Where agents don't work yet

Long-horizon planning. Anything that needs the agent to hold coherence across more than five or six tool calls. The frontier models are getting better at this. They aren't there yet.

High-stakes autonomous action. Anything where a wrong action costs money, trust, or safety. The 88% success rate that's acceptable for email classification isn't acceptable for charging a customer's credit card.

Open-ended exploration. Tasks without a clear stopping condition. The agent will eventually do something useful, and then it'll keep going, and the going-keeping is where the trouble starts.

The 2024 promise of autonomous agents replacing white-collar workflows was wrong on the timeline. In early 2026, agents are a useful tool in a narrow set of bounded, low-stakes workflows, and a science experiment everywhere else. If you're building agents, scope tightly, instrument heavily, and assume the loop will misbehave in production at least once a week.

For a working developer who wants to try this, go with LangGraph and Claude Opus 4.7 as the backing model. The explicit control matters more than the development speed. Build one agent for one task. Get it to 90% reliability before reaching for the second.

For the venture-capital pitch that says agents will replace knowledge workers in three years — the pitch is wrong. The path from current capability to general-purpose autonomous agents is longer than the slides suggest, and it goes through a stretch of work on reliability and recovery that isn't sexy enough to fund easily. The agents that will matter are the ones built carefully on the narrow set of capabilities the models actually have. Not the ones built on the speculative capabilities they're promised to have.

Whether the agent frameworks will consolidate down to two or three winners or stay fragmented across a dozen — I don't know. The signs point both ways depending on which week you check.

Bottom line

Agents work narrowly, for the four frameworks I tested across the past eighteen months. In my testing, none survived to production unattended for high-stakes work. Use LangGraph for production with explicit topology control. Use Assistants v2 for fast prototyping. Use Anthropic's computer use for UI tasks. Skip multi-agent frameworks. Cap every loop with hard token budgets and wall-clock timeouts. Run human review on any high-stakes action. The 2024 promise of autonomous agents replacing knowledge workers was wrong on the timeline.

Frequently asked

Which AI agent framework should I use in 2026?

LangGraph for production workflows where you need explicit topology control. OpenAI Assistants v2 for fast prototyping. Anthropic computer use for UI automation. Skip multi-agent frameworks like Autogen — they usually underperform a single agent.

Are AI agents production-ready?

Narrowly. Agents work for high-volume low-stakes classification, tightly-scoped single-tool calls, and human-in-the-loop assistance. They fail at long-horizon planning, high-stakes autonomous action, and open-ended exploration.

How often do AI agent deployments fail?

About 88% of agent deployments don't survive to month 6. Common failure modes: payment-processor loops, broken UI automation when the target site updates, multi-agent communication breakdowns, runaway token consumption.

Why do multi-agent setups underperform single agents?

The Frankenstein problem: adding agents adds opportunities for them to confuse each other in ways a single agent wouldn't. The conversation between agents stays internally coherent while the output drifts further from the goal.

How do I prevent agents from burning my budget?

Hard token caps per session, wall-clock timeouts (15 min max), automatic alerts when session cost crosses $5, and per-day spending limits at the provider level. The $1,000 surprise invoice is a one-time lesson.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • January 22, 2026 — Updated all OpenAI Assistants references to v2 (v3 isn't generally available).
  • March 18, 2026 — Originally published.

References

  1. LangChain, "LangGraph," langchain.com/langgraph, accessed May 2026.
  2. OpenAI, "Assistants API overview," platform.openai.com/docs/assistants/overview, accessed May 2026.
  3. Anthropic, "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku," anthropic.com/news/3-5-models-and-computer-use, October 2024.
  4. Microsoft, "AutoGen," microsoft.github.io/autogen, accessed May 2026.