How to Get Ahead of 99% of Software Engineers with AI Agents

**Newsletter 149 by Neo Kim — System Design One “Change your software workflow with AI agents”

What an AI Agent Is

An agent takes a goal, picks its own next step, uses tools, and keeps going until the job is done or it gets stuck. When you paste a bug into Claude Code, it works out the rest on its own:

Read files → Grep for functions → Run tests → Read errors → Try again

Spectrum of AI tools:

AI Autocomplete (early GitHub Copilot) — finishes the line you’re typing
Chatbot (ChatGPT) — you ask, it answers, you copy
Agent (Claude Code, Cursor) — reads files, runs tests, edits code until done

The agent must track what it’s tried and learned. That running context lets it pick sensible next steps.

The 8 Stages of Software Development with Agents

1. Plan — Agents help, but don’t decide

Good for: Reading codebase, mapping what changes will touch, writing tickets Bad for: Decision-making — weighing customer urgency, revenue, and risk

Boris Cherny (creator of Claude Code): “Someone has to prompt the Claudes, talk to customers, coordinate with other teams, and decide what to build next.”

Verdict: Useful for write-ups, but humans must choose what to build.

2. Design — Agents are textbook, not contextual

Good for: Comparing patterns (microservices vs monolith), laying out tradeoffs Bad for: Knowing your team’s real constraints — textbook answers miss undocumented context

Two failure modes:

Duplication: GitClear found duplicated code blocks increased 8x in 2024 after AI tool adoption
Over-engineering: Karpathy: models “overcomplicate, bloat abstractions, leave dead code”

Verdict: Stay with humans. Bad design is slow to surface and expensive to undo.

3. Code — Where agents shine (with setup)

Metrics: Cursor went $100M \to$ 1B ARR in ~1 year. 19% of devs use Cursor (Stack Overflow 2025).

Critical setup:

Rules files (CLAUDE.md / AGENTS.md) — standing orders so agents don’t start from zero every session
Spec-driven development — write a short spec first, agent turns it into implementation plan, execute one task at a time

Current limitation: Context window fills up. Quality drops well before the window is full. Spec-and-plan docs on disk free up the agent’s window.

4. Test — Agents are built for this

Metrics: Momentic ran 200M+ test steps in a month, caught 390K+ bugs. Diffblue Cover writes tests 250x faster.

Three failure modes:

Empty tests — pass without checking anything real
Test overload — Meta had to switch to PR-scoped tests deleted after each run
Visual testing — agents are slow/expensive testing anything on screen (screenshot → locate → click → re-screenshot)

5. Review — Speed and coverage, but noisy

Metrics: CodeRabbit runs on 6M+ repos for 15K+ teams. 278,790 real review comments studied — devs acted on 16.6% of AI suggestions vs 56.5% from humans.

Danger: Signal buried in noise. When most comments aren’t actionable, you stop reading closely.

Verdict: Use agent as first pass that never misses a file, NOT as final reviewer.

6. Deploy — Clear NO

You can’t cleanly undo a bad deploy (users already hit it)
A bad deploy takes down the entire system at once
Rollbacks are risky operations under pressure

Verdict: Let agent prepare the release and run checks. Keep a human on the final push.

7. Operate — Gather facts, don’t diagnose

Metrics: Faros AI (22K developers) — production incidents per merged change increased 242% as AI use climbed. DORA 2024 & 2025: greater AI use correlates with lower stability.

Irony: Agents are good at analyzing incidents caused by bad agent code. Datadog’s Bits AI SRE investigates alerts before you’ve opened your laptop.

Danger: Agent says its diagnosis with the same confidence whether right or wrong. You commit to the wrong fix sooner.

Verdict: Let agent gather facts. Keep diagnosis yours.

8. Maintain — Clearest yes

Metrics: Dependabot used across millions of repos. Automated security updates fix critical vulnerabilities significantly faster.

Why it works: Maintenance work grades itself — a dependency bump either passes tests or it doesn’t. The test suite is the judge.

Subtle danger: Each small change passes, but over time they stack into a harder-to-read codebase. No single commit is the culprit (same blindness as Design stage).

Verdict: Let agent handle maintenance. Keep a human to read what it actually changed.

The Pattern: Who Judges?

Agents do well wherever a machine can tell them they’re wrong (failing test, red build, type error). Agents struggle wherever the only judge is a person (right feature? right architecture? right moment to ship?).

Why More Agents ≠ Faster Teams

Nicholas Carlini (Anthropic) ran 16 Claude agents in parallel to build a C compiler over 2,000 sessions ($20K compute):

“Every agent would hit the same bug, fix that bug, and then overwrite each other’s changes.”

Agents duplicate and overwrite each other. Shared memory across the SDLC is the unsolved problem.

What to Expect When an Agent is on Your Team

METR study (2025): 16 experienced devs, 246 real tasks on their own repos:

Finished AI tasks 19% slower
Thought AI made them 20% faster
The gap: your sense of whether an agent is helping is not reliable

Your week changes: More reading code, less writing code. Different cognitive cost.

Safety controls: Sandbox, scoped permissions, audit log.

Where to start: Coding, review, maintenance (wrong answer is cheap to undo). NOT design or deploy.

References

Boris Cherny on why Anthropic keeps hiring: x.com/bcherny
Duplicated code +8x: GitClear, AI Copilot Code Quality 2025
Karpathy on bloat: x.com/karpathy
Cursor revenue: Cursor Series D
Stack Overflow 2025: survey
Context quality drops: Chroma, Context Rot
Claude Code memory docs: Anthropic
Superpowers repo: github.com/obra/superpowers
Momentic: Series A
Diffblue Cover: Cover
Meta’s just-in-time tests: The death of traditional testing
Human-AI review synergy (278K comments): study
DORA 2024: report
DORA 2025: report
Faros AI: The AI Acceleration Whiplash
Datadog Bits AI SRE: launch
Nicholas Carlini C compiler: Building a C compiler with Claude
METR study: paper
Replit agent deleted production DB: Fortune
CodeRabbit Slack agent: launch

Huy's Wiki

Explorer

#149: How to Get Ahead of 99% of Software Engineers with AI Agents