A billing-recovery agent built on LangGraph was about to re-charge a customer's card for the fourth time on a single disputed transaction when the kill switch caught it. The agent had been running unattended for three hours.
The payment-processor event semantics were subtle enough — refund, partial refund, dispute, disputed-after-refund — that the agent kept picking the wrong retry decision. The kill switch was the only thing that prevented a real loss. That's the texture of agentic work in early 2026: capable enough to look like it works, brittle enough that running it unattended is a real risk. The pitch from eighteen months ago was correct in principle and wildly premature on timeline.
This piece is a field report on the four agentic stacks worth taking seriously right now — LangGraph, OpenAI Assistants v2, Anthropic's computer use, and Microsoft Autogen — measured against three real tasks built and shipped over the past six months. None of these stacks produced a production agent worth running unattended. Two produced agents that work with significant guardrails. The other two taught useful lessons about where the wall currently is.
I went into the eighteen-month window expecting at least one of the four frameworks to be solidly production-ready by now. None of them are, for high-stakes autonomous action. That's not a failure of the frameworks — it's a failure of the underlying capability. The agent loop's reliability is a frontier-model property, not a framework property, and the frontier models aren't there yet.
What the tasks were
Three tasks, each picked because the definition of done is clear and the task isn't trivially solvable by a single LLM call.
Task A: an inbox-triage agent that reads incoming support emails, classifies the issue, looks up the customer's account in a small database, drafts a response, and either sends it directly for low-stakes issues or queues it for human review for higher-stakes ones. Three tools: read database, classify issue, send/queue email.
Task B: a crash-report triage agent that watches an error stream, identifies new clusters of failures, looks up the relevant code paths in a Git repository, and writes a triage note for the engineering team to review the next morning. Four tools: read crash endpoint, search Git, read file from repo, write triage doc.
Task C: a billing-recovery agent that watches a payment processor for failed charges, classifies the failure reason, and either retries the charge after a delay or sends a templated email asking the customer to update their card. Three tools: read payment events, retry charge, send email.
Each is a small business workflow. Each is the kind of thing the agent demos suggest should be trivial. None of them turned out to be trivial.
Worth flagging: this evaluation is based on production deployments at the team scale (single-digit engineers maintaining the agent). Enterprise multi-team agent deployments have different failure modes I haven't tested. The vendor claims about reliability at scale are exactly the kind of thing this piece is skeptical of, but the counter-evidence isn't something I can produce from my own work.
LangGraph
LangGraph is the agent framework with the most working hours behind it in this evaluation. Its model is a graph of nodes — each node either an LLM call, a tool call, or a deterministic Python function — with explicit transitions between them. The framework gives you control over the topology, which is what you need once you've decided that the LLM-makes-all-decisions style isn't reliable enough.
Task A worked. Two long weekends of work, an agent that on a held-out set of 200 emails classified correctly, drafted a response that was willing-to-send, and routed it to the right queue 88% of the time. The 12% failure rate concentrated on weird inputs. Multiple unrelated questions in one email, attachments referenced by name without being attached, replies-to-replies treated as new threads.
Task B partially worked. The agent could identify error clusters and find the relevant code. The triage notes it wrote were inconsistent in quality. Sometimes excellent, sometimes a confused summary of three unrelated bugs. After a week of shipping it, it got turned off because reading the triage notes was taking as long as triaging the bugs by hand would have.
Task C failed. The payment-processor event semantics are subtle enough that the agent kept making the wrong retry decision. After a session where it almost re-charged a customer four times for a single disputed transaction, the agent was shut down and the workflow returned to manual review.
OpenAI Assistants v2
The current major iteration of OpenAI's Assistants API, released September 2024 and still the production line in early 2026. The model is higher-level than LangGraph. Describe tools and instructions, let the platform handle the loop. It's the closest thing in the space to "just describe what you want."
The trade-off is visibility. When the agent does the wrong thing, debugging means reading the logs of which tools were called in which order. OpenAI exposes this but doesn't make it pleasant to navigate. For a working developer, that friction matters more than the development speed gain.
Task A built on Assistants v2 hit comparable accuracy to the LangGraph version in about a third of the development time. The development experience is way faster. When the agent failed in production, the cause was harder to find, which constrained the iteration speed. For low-stakes workflows that's acceptable. For anything where a wrong action costs money or trust, the LangGraph-style explicit control is the right fit.
Anthropic's computer use
A different kind of capability, introduced in Anthropic's computer use announcement. Instead of calling APIs, the agent sees a virtual screen, moves the mouse, types, and reads the result. This opens up tasks with no API surface. Anything in a desktop GUI, anything that requires logging into a website without a clean programmatic interface.
Task A was tested on this stack in a deliberately constrained variant: read emails from a webmail interface, classify them visually, write to a spreadsheet. The point was to see whether computer use is a faster path to an end-to-end agent than the API-call approach.
It worked. Slowly. The agent successfully read 87 of 100 test emails, classified 78 correctly, and wrote 73 to the sheet. The end-to-end success rate was lower than the API-based version, and the latency was way higher. About 40 seconds per email versus 3 seconds. The interesting failure mode: computer use breaks when the UI changes, which it does constantly. The webmail compose interface shifted twice during the testing window, and the agent broke each time until it was re-tuned.
For long-tail tasks with no API that don't change often, computer use is the right tool. For anything run thousands of times against a UI that may change, the API path is structurally more durable.
Microsoft Autogen
Microsoft AutoGen's pitch is multi-agent. Instead of one agent doing everything, compose a team of specialist agents that collaborate. Task B was given the longest leash on this stack: a triage team with a crash-reader agent, a code-searcher agent, and a writer agent.
The result was the most interesting failure in the test. The agents talked to each other constantly. The conversation was internally coherent. The output was worse than the single-agent LangGraph version, because the multi-agent setup introduced new failure modes. Agents disagreeing about classification. Agents handing off incomplete context. Agents getting into mutual loops where each waited for the other to act.
Call it the Frankenstein problem. Adding agents doesn't add intelligence. It adds opportunities for the agents to confuse each other in ways a single agent wouldn't. There may be tasks where the multi-agent decomposition wins. None turned up in this evaluation.
Every agent demo looks magical. Every agent in production looks like duct tape. The gap between the two is where the next two years of work need to happen.
One genuine uncertainty: whether OpenAI's upcoming Assistants v3 (announced in their December dev day, not yet generally available) closes any of the gaps named below. The preview demos look strong on debugging visibility. The preview demos always look strong. Worth re-testing when v3 ships.
Where agents actually work in 2026
Three categories where an agent in production is worth trusting today.
First: high-volume, low-stakes classification or routing. What Task A turned out to be. When the cost of being wrong is small and the volume is large, an 88% success rate is a real productivity gain. The wrong answers get caught downstream by humans, retries, or simple sanity checks. The agent pays for itself on the easy cases.
Second: tightly scoped tool calling. Single tool, single decision, clear stopping condition. Search the docs and return the relevant section. Look up the customer record. Fetch the weather and return it as JSON. These are agents in the most generous sense. They're closer to LLMs with a single function call. They work because there's no loop to fall out of.
Third: human-in-the-loop assistance. The agent does the legwork. A human approves the action. That's the model behind every coding assistant that ships real code, and it works because the human catches the failures the agent would otherwise commit.
Read the world (API, database, screenshot).
LLM picks the next action from a tool list.
Side effects happen here. This is where money disappears.
Goal met → done. Else → back to step 1. Cap the loop.
-
Mar 2024
LangChain Agents
First broadly-used framework. Tool-calling templates that hid the loop.
-
Aug 2024
LangGraph
Explicit graph topology. The framework people actually ship.
-
Sep 2024
OpenAI Assistants v2
Higher-level API. Faster to prototype, harder to debug.
-
Oct 2024
Anthropic computer use
Agent that sees a screen and uses a mouse + keyboard.
-
2025
Multi-agent everywhere
Autogen and friends. Mostly worse than one well-designed agent.
LangGraph
Production Best topology controlAssistants v2
Prototypes Fastest idea-to-demoComputer use
UI automation Brittle on dynamic UIsAutogen
Research Multi-agent experimentsWhere agents don't work yet
Long-horizon planning. Anything that needs the agent to hold coherence across more than five or six tool calls. The frontier models are getting better at this. They aren't there yet.
High-stakes autonomous action. Anything where a wrong action costs money, trust, or safety. The 88% success rate that's acceptable for email classification isn't acceptable for charging a customer's credit card.
Open-ended exploration. Tasks without a clear stopping condition. The agent will eventually do something useful, and then it'll keep going, and the going-keeping is where the trouble starts.
The 2024 promise of autonomous agents replacing white-collar workflows was wrong on the timeline. In early 2026, agents are a useful tool in a narrow set of bounded, low-stakes workflows, and a science experiment everywhere else. If you're building agents, scope tightly, instrument heavily, and assume the loop will misbehave in production at least once a week.
For a working developer who wants to try this, go with LangGraph and Claude Opus 4.7 as the backing model. The explicit control matters more than the development speed. Build one agent for one task. Get it to 90% reliability before reaching for the second.
For the venture-capital pitch that says agents will replace knowledge workers in three years — the pitch is wrong. The path from current capability to general-purpose autonomous agents is longer than the slides suggest, and it goes through a stretch of work on reliability and recovery that isn't sexy enough to fund easily. The agents that will matter are the ones built carefully on the narrow set of capabilities the models actually have. Not the ones built on the speculative capabilities they're promised to have.
Whether the agent frameworks will consolidate down to two or three winners or stay fragmented across a dozen — I don't know. The signs point both ways depending on which week you check.