December 12, 2025: four coding assistants on the same workstation, same project, same set of tasks. Cursor with its built-in agent. GitHub Copilot in VS Code. Windsurf as a standalone editor. Sourcegraph Cody as a VS Code extension. The plan was simple. Give each one the same feature to build in a real working codebase. Watch what each produced. Take notes.
Honestly, I expected Cursor to lose this one. It's the most expensive of the four, and on paper, the others have similar agent loops. I was wrong. The integration tightness around the codebase made the difference.
The codebase was an open-source static-site generator with around 80,000 lines, written in a mainstream language with a typical layered architecture. The feature was already on the project roadmap, so the best result could actually ship. The assistants didn't know they were being compared.
The feature: a Markdown exporter module. The user clicks a button. The static-site generator walks the existing content tree, produces one Markdown file per directory level, preserves YAML frontmatter, references images rather than embedding them, and writes output to a dist/markdown/ directory. The work touches roughly seven files: a new MarkdownExporter class, a writer service, a frontmatter handler, a Markdown formatter, a content-tree walker, a manifest reader, and a registration update in the DI container.
This is a normal day's work for a competent solo developer. It's also a feature with enough surface area for a coding assistant to mess up in interesting ways. Which all four of them did. The interesting differences aren't in the headline pass-fail outcomes — all four eventually produced something usable. They're in the texture of the code each left behind, which is what you have to live with.
The prompt and the ground rules
Each tool received the same initial prompt, adapted to its interface:
Add a new MarkdownExporter module to the static-site generator.
Look at how the existing HtmlExporter and PdfExporter are structured —
the new module should follow the same pattern.
Requirements:
- Read content tree from a JSON manifest
- Walk the tree producing a single Markdown file per directory level
- Preserve frontmatter as YAML at the top of each file
- Reference images rather than embedding them
- Write output to a dist/markdown/ directory
- Register in the DI container
Match the existing code style. Use async/await throughout.
Use the existing ILogger abstraction. Do not introduce new dependencies.
Each tool had access to the full codebase. Each was given as much time as it wanted before the attempt was declared complete. Each attempt ran twice on consecutive days, and the better of the two runs is reported below. The charitable comparison.
(One note worth flagging before the tool-by-tool walkthrough: at one point during the Cursor testing, the agent loop got into a recursive call pattern and burned through about $3 of API credits in four minutes before I caught it. None of the other tools did this. The pattern was: agent asks itself a clarifying question, agent answers, agent decides the answer needs another clarification, repeat. Worth knowing if you're going to leave an agent running unsupervised.)
Cursor (Claude Opus 4.7 backend)
Cursor ran in agent mode with Claude Opus 4.7 selected as the model. This is the default coding setup outside this comparison, which introduces a small bias the review acknowledges. The other assistants got the same fairness treatment in their respective evaluations.
Cursor's agent spent about 90 seconds reading the codebase, then produced a plan in the chat panel. A numbered file list with one sentence per file describing what it intended to create or edit. It asked for confirmation. Confirmation was given.
The implementation took roughly seven minutes of agent activity. It created six of the seven expected files, plus an unrequested seventh (a small MarkdownLinter class that catches malformed frontmatter, not asked for but useful). The code compiled on the first run with no errors.
Two bugs were present. The first: the writer service wrote to dist/markdown/ without checking whether the directory existed. On a fresh checkout, the first export would throw. Easy to spot, easy to fix.
The second was more interesting. Cursor wired up the dependency-injection registration correctly but registered the exporter as a transient dependency, where the rest of the application registers stateful services as singletons. The in-memory frontmatter cache got rebuilt on every directory walk, defeating the cache's purpose. That's the kind of bug that would have shipped if you weren't paying close attention.
The output Markdown files followed the existing pattern faithfully. Naming, frontmatter ordering, image-reference syntax — all consistent with the rest of the codebase. The async/await usage was correct, with proper cancellation-token threading. The frontmatter contract used a sensible shape worth maintaining for years. Two bugs. About sixty minutes of cleanup. The feature was shippable within an hour of Cursor finishing.
I went into Copilot testing expecting it to be slightly behind Cursor. It was further behind than that. The agent mode produced more compile errors per task than any other tool tested, and the back-and-forth interaction pattern meant I was on the loop the entire time rather than just at the review stage at the end.
GitHub Copilot in VS Code
Copilot ran in its agent mode inside VS Code. The integration is less polished than Cursor's. The agent's working state is harder to inspect at a glance, and the interaction is more keystroke-driven.
Copilot's first attempt produced two of the seven files. It created the exporter class and the writer service, then stopped and asked what to do about the missing DI registration. After four prompts and roughly twenty minutes of back-and-forth, all seven files were present.
The code compiled with three errors. Two were trivial — missing using statements that Copilot forgot to add. The third was that the writer service called a method on a YAML helper class that didn't exist. Copilot had invented a method that fit the kind of thing it expected and produced code that referenced it without checking. That's a painful bug to debug because the code looks correct at a glance.
The structural quality was okay but not impressive. The writer service used a pattern slightly different from the rest of the codebase. It raised PropertyChanged events manually in places where the rest of the project uses a source generator. The content-tree walker was correct but visually distinct from the other walkers in subtle ways (different recursion guards, different ordering for sibling directories) that needed manual cleanup. Three compile errors, one of them confidently wrong. Pattern divergence in the writer service added another twenty minutes. About two hours of total work to ship.
Windsurf (Cascade agent)
Windsurf is the editor formerly known as Codeium, now with a strong agent mode that competes directly with Cursor. The Cascade agent was tested in its default configuration, which uses a frontier model under the hood. The docs were vague about which one, but the behavior is consistent with Claude Sonnet 4.7.
The agent's planning step was less explicit than Cursor's. It didn't show the planned file list before starting. The chat history had to be reconstructed after the fact to understand what had been decided.
The output was strong. All seven files, plus a small documentation update in the project README that hadn't been asked for but did no harm. The code compiled cleanly. The first runtime test succeeded. Clicking the export button produced a dist/markdown/ directory full of correctly formatted files.
Two issues surfaced on closer review. The first was a missing null check in the manifest reader, which would crash on empty content trees. The second was more philosophical: Windsurf chose to add a new dependency (YamlDotNet) for frontmatter parsing, despite the rest of the codebase using a smaller hand-rolled YAML parser. The new dependency would compile and run fine. It was just inconsistent with the existing code. One bug, one stylistic divergence. About forty minutes of cleanup. Of the four assistants, this was the closest in quality to Cursor.
The interesting differences between these tools aren't in the headline benchmarks. They're in the texture of the code they leave behind — the part you have to live with.
Sourcegraph Cody in VS Code
Cody has been pivoting toward a more agent-heavy product over the past six months. It was tested as a VS Code extension with the latest model selection. Claude Opus 4.7 configured as the backend, the same model Cursor was running.
Cody's strength is its codebase indexing. Given the prompt, it correctly identified the existing exporter files, the DI container code, the manifest reader, the content tree walker, and the relevant writer patterns. The chat summary that walked through what it had found was the best of the four tools. Really useful as a pre-implementation document, even before any code was written.
The code generation was where things got rougher. Cody produced files one at a time, asking for confirmation between each. That's fine for a careful workflow and slow for a feature spec like this one. After roughly thirty minutes of back-and-forth, all seven files were in place.
The compile result was clean. Three issues surfaced at runtime. First, the writer service used synchronous file IO inside an async method. Not technically wrong, but it defeats the point. Second, the formatter was missing handling for code-block fences that the rest of the application uses for syntax preservation. It would have produced broken Markdown for any page containing code. Third, and most concerning, the writer overwrote existing files in dist/markdown/ without first checking whether the file was newer than its source — defeating the incremental-export feature flagged as required in the existing exporter pattern.
The third issue would have been a production problem. A user re-exporting after small content changes could end up overwriting hand-edits without warning. The fact that Cody didn't pick up the atomic-write pattern in the existing modules despite indexing them is the kind of failure that gives you pause about trusting it for architecturally important changes. Three bugs, one of them dangerous. About 90 minutes of cleanup.
The scoreboard
Pulling the four tools into prose instead of a grid: Cursor on Claude Opus 4.7 produced no compile bugs and two runtime bugs, with roughly 60 minutes of cleanup after the agent finished. Windsurf on its Cascade agent (likely Claude Sonnet 4.7 underneath) was the closest competitor — zero compile bugs, two runtime bugs, about 40 minutes of cleanup. The faster cleanup on Windsurf reflects that one of its two bugs was a trivial null-check and the other was a dependency-choice question that took one decision to resolve.
GitHub Copilot's agent mode produced three compile bugs (mostly missing imports) and one runtime bug, with about 120 minutes of total cleanup — most of the time spent on a fabricated method call that the code referenced but didn't exist. Sourcegraph Cody produced no compile bugs and three runtime bugs, one of which would have shipped a footgun: a writer service that overwrote existing files without first checking if the destination was newer. About 90 minutes of cleanup, with the atomic-write fix being most of the work.
The interesting row is Windsurf. It produced fewer bugs than Cursor and a faster cleanup time, even though both were excellent. The difference between the two is texture. Cursor's agent is more transparent about its plan, asks for confirmation at the right moments, and surfaces what it's doing in the editor in a way that helps you catch problems early. Windsurf is faster but less legible. Either choice works.
The Cody result was a disappointment. Going in, the expectation was that Cody would win on architectural awareness, because its codebase indexing is the best in the field. Instead, Cody produced more bugs than the other two Claude-backed tools, including one that would have shipped a footgun. The indexing helped Cody understand the context. It didn't help Cody write code that respected the patterns it had read.
The Copilot result is roughly what you'd expect from a tool oriented around line-by-line completions that has been retrofitted into agent mode. The friction of the back-and-forth interaction added more total time than the bug count alone suggests. Copilot is still excellent at autocomplete and competent at multi-file features, but the gap between Copilot and Cursor on agent work is wider than the marketing implies.
One genuine uncertainty: how much of the Copilot result is the agent mode vs the underlying model selection. GitHub doesn't publish which model backs each request — it's noted as "GPT-5 / mixed" because the routing seems to pick different models for different parts of the task. Whether a deterministic Opus-only routing would close the gap is a question I can't answer from outside the system.
What's not in the scoreboard
Three things this comparison didn't measure but you should weigh.
First: cost. Cursor charges $20/month for the Pro tier, per Cursor's pricing page, which includes a generous Claude Opus quota and unlimited Sonnet calls. Windsurf — the product page lives at windsurf.com — is comparable. Copilot for Individual is $10/month per GitHub's Copilot features page, with agent mode gated behind higher tiers. Cody's pricing has shifted twice in six months and isn't worth quoting in print; current rates live on Sourcegraph's Cody page. For a solo developer, all four are affordable. For a team of fifty, the math changes a lot. See price per use case for the broader cost picture.
Second: editor preference. Fifteen years of muscle memory in a particular editor is a real cost to switch away from. Copilot lets you stay where you are. That's worth real money to some readers.
Third: the open question of where these tools will be in twelve months. The product pace is fast. Any verdict written now has a six-month half-life at best.
Cursor
$20/mo Claude Opus 4.7 backendCursor verdict
Pick Best plan-then-code flowWindsurf
$20/mo Cascade agent, Sonnet-classWindsurf verdict
Close 2nd Fast, less hand-holdingCopilot
$10/mo VS Code, GPT-5 mixCopilot verdict
Skip agent Autocomplete still best-in-classCody
Variable Opus 4.7, deep indexingCody verdict
Wait Worth re-testing in Q2Feature spec + reference module names.
Agent reads codebase, writes 7 files.
Compile + run + first manual exercise.
Fix bugs, align patterns, polish.
The pick
For solo developers and small teams shipping real features in real codebases, for this workload, in my testing, go with Cursor Pro and Claude Opus 4.7 as the backend. The win isn't on benchmarks. It's on the texture of what the tool does, the moments when it stops to ask, the way the agent's plan is visible in the editor before code starts changing. Those are the properties that matter most when the code in question is going to be maintained for years.
Windsurf is a solid second pick. If Cursor's editor feels too opinionated, or its chat interface is the wrong fit, Windsurf will produce comparable code with slightly less ceremony. The Cascade agent is good.
GitHub Copilot is the right answer for autocomplete in Visual Studio and Visual Studio Code. Its agent mode is improving but is still a step behind. Cody had the best codebase awareness and the worst bug profile, which is a combo that's hard to recommend. Worth watching. Worth re-testing in six months. For background on where the agent layer is going more broadly, see AI agents, eighteen months in.
The most important point in this whole comparison isn't on the scoreboard: every one of these tools produced code that needed human review before shipping. None of them is a replacement for a competent maintainer. The differences are in how easy each tool makes the review and how much rework the review demands. Cursor demands less rework than the alternatives, in my testing, which is why it earns the pick. But the rework isn't zero, and any team treating these tools as autonomous engineers is going to ship the bugs they didn't catch.