How to Get Ahead of 99% of Software Engineers with AI Agents
**Newsletter 149 by Neo Kim — System Design One “Change your software workflow with AI agents”
What an AI Agent Is
An agent takes a goal, picks its own next step, uses tools, and keeps going until the job is done or it gets stuck. When you paste a bug into Claude Code, it works out the rest on its own:
Read files → Grep for functions → Run tests → Read errors → Try again
Spectrum of AI tools:
- AI Autocomplete (early GitHub Copilot) — finishes the line you’re typing
- Chatbot (ChatGPT) — you ask, it answers, you copy
- Agent (Claude Code, Cursor) — reads files, runs tests, edits code until done
The agent must track what it’s tried and learned. That running context lets it pick sensible next steps.
The 8 Stages of Software Development with Agents
1. Plan — Agents help, but don’t decide
Good for: Reading codebase, mapping what changes will touch, writing tickets Bad for: Decision-making — weighing customer urgency, revenue, and risk
Boris Cherny (creator of Claude Code): “Someone has to prompt the Claudes, talk to customers, coordinate with other teams, and decide what to build next.”
Verdict: Useful for write-ups, but humans must choose what to build.
2. Design — Agents are textbook, not contextual
Good for: Comparing patterns (microservices vs monolith), laying out tradeoffs Bad for: Knowing your team’s real constraints — textbook answers miss undocumented context
Two failure modes:
- Duplication: GitClear found duplicated code blocks increased 8x in 2024 after AI tool adoption
- Over-engineering: Karpathy: models “overcomplicate, bloat abstractions, leave dead code”
Verdict: Stay with humans. Bad design is slow to surface and expensive to undo.
3. Code — Where agents shine (with setup)
Metrics: Cursor went 100M → 1B ARR in ~1 year. 19% of devs use Cursor (Stack Overflow 2025).
Critical setup:
- Rules files (CLAUDE.md / AGENTS.md) — standing orders so agents don’t start from zero every session
- Spec-driven development — write a short spec first, agent turns it into implementation plan, execute one task at a time
Current limitation: Context window fills up. Quality drops well before the window is full. Spec-and-plan docs on disk free up the agent’s window.
4. Test — Agents are built for this
Metrics: Momentic ran 200M+ test steps in a month, caught 390K+ bugs. Diffblue Cover writes tests 250x faster.
Three failure modes:
- Empty tests — pass without checking anything real
- Test overload — Meta had to switch to PR-scoped tests deleted after each run
- Visual testing — agents are slow/expensive testing anything on screen (screenshot → locate → click → re-screenshot)
5. Review — Speed and coverage, but noisy
Metrics: CodeRabbit runs on 6M+ repos for 15K+ teams. 278,790 real review comments studied — devs acted on 16.6% of AI suggestions vs 56.5% from humans.
Danger: Signal buried in noise. When most comments aren’t actionable, you stop reading closely.
Verdict: Use agent as first pass that never misses a file, NOT as final reviewer.
6. Deploy — Clear NO
- You can’t cleanly undo a bad deploy (users already hit it)
- A bad deploy takes down the entire system at once
- Rollbacks are risky operations under pressure
Verdict: Let agent prepare the release and run checks. Keep a human on the final push.
7. Operate — Gather facts, don’t diagnose
Metrics: Faros AI (22K developers) — production incidents per merged change increased 242% as AI use climbed. DORA 2024 & 2025: greater AI use correlates with lower stability.
Irony: Agents are good at analyzing incidents caused by bad agent code. Datadog’s Bits AI SRE investigates alerts before you’ve opened your laptop.
Danger: Agent says its diagnosis with the same confidence whether right or wrong. You commit to the wrong fix sooner.
Verdict: Let agent gather facts. Keep diagnosis yours.
8. Maintain — Clearest yes
Metrics: Dependabot used across millions of repos. Automated security updates fix critical vulnerabilities significantly faster.
Why it works: Maintenance work grades itself — a dependency bump either passes tests or it doesn’t. The test suite is the judge.
Subtle danger: Each small change passes, but over time they stack into a harder-to-read codebase. No single commit is the culprit (same blindness as Design stage).
Verdict: Let agent handle maintenance. Keep a human to read what it actually changed.
The Pattern: Who Judges?
Agents do well wherever a machine can tell them they’re wrong (failing test, red build, type error). Agents struggle wherever the only judge is a person (right feature? right architecture? right moment to ship?).
Why More Agents ≠ Faster Teams
Nicholas Carlini (Anthropic) ran 16 Claude agents in parallel to build a C compiler over 2,000 sessions ($20K compute):
“Every agent would hit the same bug, fix that bug, and then overwrite each other’s changes.”
Agents duplicate and overwrite each other. Shared memory across the SDLC is the unsolved problem.
What to Expect When an Agent is on Your Team
METR study (2025): 16 experienced devs, 246 real tasks on their own repos:
- Finished AI tasks 19% slower
- Thought AI made them 20% faster
- The gap: your sense of whether an agent is helping is not reliable
Your week changes: More reading code, less writing code. Different cognitive cost.
Safety controls: Sandbox, scoped permissions, audit log.
Where to start: Coding, review, maintenance (wrong answer is cheap to undo). NOT design or deploy.
References
- Boris Cherny on why Anthropic keeps hiring: x.com/bcherny
- Duplicated code +8x: GitClear, AI Copilot Code Quality 2025
- Karpathy on bloat: x.com/karpathy
- Cursor revenue: Cursor Series D
- Stack Overflow 2025: survey
- Context quality drops: Chroma, Context Rot
- Claude Code memory docs: Anthropic
- Superpowers repo: github.com/obra/superpowers
- Momentic: Series A
- Diffblue Cover: Cover
- Meta’s just-in-time tests: The death of traditional testing
- Human-AI review synergy (278K comments): study
- DORA 2024: report
- DORA 2025: report
- Faros AI: The AI Acceleration Whiplash
- Datadog Bits AI SRE: launch
- Nicholas Carlini C compiler: Building a C compiler with Claude
- METR study: paper
- Replit agent deleted production DB: Fortune
- CodeRabbit Slack agent: launch