Coding With Agents: What’s Different, What’s at Stake
Kolen Cheung
May 6th, 2026
“An AI agent is an LLM wrecking its environment in a loop.” — Solomon Hykes (via Willison 2025a)
“I have never felt more behind as a programmer.” — Karpathy, Sequoia Ascent 2026 (Karpathy 2026)
Models crossing the threshold: Claude Opus 4.5, Codex 5, Gemini 3
Tools: Claude Code (Feb 2025), Codex CLI (Apr 2025), Gemini CLI (Jun 2025), OpenCode
Summer 2025 (Lex Fridman podcast):
AI coding tools made “competence drain out of my fingers.” Programming is like playing guitar — you don’t let someone else play for you.
Spring 2026 (Hansson 2026):
“It’s more like working on a team… I just review the final outcome, offer guidance when asked, and marvel at how this is possible at all.”
What changed: not his philosophy, but the tools.
“If you’re going to exploit these new tools, you need to be operating at the top of your game.” (Willison 2025c)
AI rewards existing engineering practices:
“AI tools amplify existing expertise.”
The readings cite tests, documentation, and version control as what makes agents fly. Reproducibility is conspicuously absent — yet it may matter as much:
Encourages separating side effects and statefulness — reproducible code is easier for agents to reason about. Functional programming’s edge in producing correct programs is the same advantage. An agent working on a pure, stateless module makes far fewer silent mistakes than one entangled in global state.
Enables agent bootstrapping — a fully reproducible project can tell an agent: clone, run the install script, run the tests, fix failures. E.g. curl -fsSL https://pixi.sh/install.sh | sh && pixi run test. The agent can spin up its own environment in a sandbox or cloud instance with no human scaffolding. Good CI practices are the same idea, already automated.
Improves verifiability — Karpathy’s thesis is that LLMs advance fastest where outputs can be verified. Reproducibility makes outputs verifiable: the same input must produce the same output, which is a directly checkable invariant.
Enables safe sandboxing — a self-contained, reproducible project can be handed to an agent running in a throwaway container or remote VM. This directly mitigates the exfiltration risk from the lethal trifecta (Willison 2025b): private data stays out of scope by construction.
Bisectability — deterministic builds make git bisect reliable. An agent can automatically bisect a regression, confident that differences in output reflect code changes rather than environment drift.
Idempotency — reproducible workflows tend to be idempotent. Agents can safely retry failed steps, re-run the pipeline, or roll back without worrying about accumulated side effects corrupting state.
An effective loop needs (Willison 2025a):
Works best for: debugging, performance optimisation, dependency upgrades, refactoring — anywhere with trial-and-error and verifiable outcomes.
Cf. Karpathy’s verifiability thesis (Karpathy 2026): AI advances fastest where outputs can be verified (tests pass/fail, code compiles/crashes, benchmarks improve).
The correct moment to introduce an agent is after you understand the problem well enough to judge the answer.
The consensus across readings:
“AI tools amplify existing expertise.” — Willison (Willison 2025c)
“Vibe coding raises the floor. Agentic engineering raises the ceiling.” — Karpathy (Karpathy 2026)
Evidence:
But is this a comforting just-so story that flatters senior engineers?
DHH’s guitar analogy (Lex Fridman podcast, summer 2025):
The pleasure of programming is in the playing, not just the output.
DHH reversed himself in six months (Orosz 2026). But the feeling didn’t vanish for everyone.
Orosz survey: some Builders report grief at no longer coding by hand (Orosz and Nilsson 2026).
Karpathy (Karpathy 2026):
“You can outsource your thinking, but you can’t outsource your understanding.”
Open questions:
Ira Glass’s observation: individuals acquire taste much faster than talent.
Coding agents narrow this gap:
The flip side — if you lack taste or domain knowledge:
You cannot judge outputs you cannot recognise as wrong.
A deeper corollary: as AI capability surpasses human ability in a domain, the human stops noticing further advances — because evaluation requires being close enough to the ceiling to see it.
Orosz & Nilsson survey (900+ engineers) (Orosz and Nilsson 2026):
Three archetypes:
Willison’s vibe engineering requires rigorous tests and specs to be responsible. What does responsible agentic coding look like for research software, where the test harness often doesn’t exist and “correct” is contested?
DHH reversed his position in ~6 months. Has anyone here had a similar shift? What triggered it?
Willison says AI amplifies existing expertise. Does that match your experience — is there a skill AI makes less valuable?
“You can outsource your thinking, but you can’t outsource your understanding.” Where’s the line for research software?
If an RSE uses an agent to write code for a research project, who is responsible for the correctness of that code? Does the answer change if it produces a published result?
Individual RSEs may have strong AI preferences — from agent-first to AI-free — and project PIs have their own requirements driven by personal preference or funding constraints. Should RSE–project matching take these into account, so that an RSE who only works with agents is not assigned to a project that prohibits AI use, and vice versa?
The unit of programming changed from typing lines of code to delegating larger “macro actions”… This is why I think the profession is being refactored. The programmer is increasingly not just a code writer, but an orchestrator of agents.
My core automation framework is:
- Traditional software automates what you can specify.
- LLMs and reinforcement learning automate what you can verify.
capability spike ~= verifiability x training attention x data coverage x economic value
I distinguish two related but different ideas:
- Vibe coding raises the floor. It lets almost anyone create software by describing what they want.
- Agentic engineering raises the ceiling. It is the professional discipline of coordinating fallible agents while preserving correctness, security, taste, and maintainability.
The old “10x engineer” idea may become much more extreme. People who master agentic workflows may outperform others by far more than 10x.
This means products need agent-native surfaces… I think about this in terms of sensors and actuators. A sensor turns some state of the world into digital information. An actuator lets an agent change something. The future stack is agents using sensors and actuators on behalf of people and organizations.
The right posture is neither dismissal nor blind trust. It is empirical familiarity: learn where they work, where they fail, what they were trained for, and how to build guardrails around them.
You can outsource your thinking, but you can’t outsource your understanding.
The scarce thing is shifting:
- Less scarce: code generation, API recall, boilerplate, first drafts, repetitive setup, simple transformations.
- More scarce: understanding, taste, eval design, security, system boundaries, agent orchestration, domain-specific feedback loops, and knowing when the model is off the rails.
My current worldview is not that AI simply makes everyone faster at the old work. It is that the work itself is being reorganized around agents. Software, research, education, infrastructure, and knowledge work are all becoming variations of the same pattern:
- define the context
- define the tools
- define the feedback loop
- define the guardrails
- let agents work
- preserve human understanding
I feel like vibe coding is pretty well established now as covering the fast, loose and irresponsible way of building software with AI—entirely prompt-driven, and with no attention paid to how the code actually works.
If you’re going to really exploit the capabilities of these new tools, you need to be operating at the top of your game. You’re not just responsible for writing the code—you’re researching approaches, deciding on high-level architecture, writing specifications, defining success criteria, designing agentic loops, planning QA, managing a growing army of weird digital interns who will absolutely cheat if you give them a chance, and spending so much time on code review.
Almost all of these are characteristics of senior software engineers already!
A big win from using AI agents is tackling stuff that you wouldn’t have before. A senior engineer at 37signals ran a “P1 optimization” project to improve the fastest 1% of requests.
Running several AI agents feels less like “project management” and more like “wearing a mech suit.”
37signals has one designer for every two engineers.
AI agents could turn 37signals’ “designer model” into the industry standard.
Command Line Interfaces (CLI) feel like the ultimate AI interface, which validates the Unix philosophy of the 1970s.
Eight hours of sleep is non-negotiable – even during an AI gold rush!
The thing to look out for here are problems with clear success criteria where finding a good solution is likely to involve (potentially slightly tedious) trial and error.
The lethal trifecta of capabilities is:
- Access to your private data—one of the most common purposes of tools in the first place!
- Exposure to untrusted content—any mechanism by which text (or images) controlled by a malicious attacker could become available to your LLM
- The ability to externally communicate in a way that could be used to steal your data (I often call this “exfiltration” but I’m not confident that term is widely understood.)