eternalsix

Posted on Jun 16 • Originally published at eternalsix.com

Why I stopped using ChatGPT for code reviews

#ai #productivity #saas #buildinpublic

Why I Stopped Using ChatGPT for Code Reviews (And What I Use Instead)

Last month I pasted 400 lines of a TypeScript service into ChatGPT and asked it to review for security vulnerabilities. It told me my code was "well-structured" and "followed best practices." It missed a raw SQL string concatenation that would have been a textbook SQL injection if we'd shipped it. That was the moment I started treating LLM code reviews as a process problem, not just a prompt problem.

The Flattery Problem Is Real and Underreported

ChatGPT is trained to be helpful. Helpful, in RLHF terms, often correlates with positive and affirming. When you paste code and ask "is this good?", you are essentially asking a model that wants you to feel good about the interaction. It will find things to praise. It will soften criticism. It will hedge.

I've run this experiment enough times to be confident: if you paste buggy code into ChatGPT with a confident framing ("here's my optimized auth flow, any final thoughts?"), you will get a more favorable review than if you paste the same code and say "find everything wrong with this." The framing changes the output dramatically. That's not a feature. In a code review context, it's a liability.

The fix is not "just prompt it better." That puts the burden on the reviewer — which is you, the person who wrote the code and already has blind spots.

Context Windows Are a False Promise

Every few months a new model ships with a bigger context window and people get excited about pasting entire codebases. I get it. I've done it. The problem is that attention is not uniform across 200k tokens. Models degrade on long-context reasoning in ways that are subtle and hard to catch. They'll reference a function from line 12 when the actual logic changed at line 3,847. They'll miss that a variable you defined early was redefined later. They'll give you confident answers that reflect an earlier part of the context, not the current one.

Code review requires holding an entire mental model of the system simultaneously — not just the file you're looking at, but the contracts between services, the assumptions baked into the ORM, the edge cases that live in a config file three directories away. No context window solves that problem right now. Any tool that claims otherwise is marketing, not engineering.

The Single-Model Monoculture Risk

When your entire team does AI-assisted code review through the same model, you inherit that model's blind spots at scale. GPT-4 has a documented tendency to miss certain classes of async bugs. Claude is better at some of those but weaker on others. Neither is consistently good at catching subtle race conditions in distributed systems.

If everyone on your team uses ChatGPT for code review and ChatGPT systematically underweights a category of bug, those bugs ship. You don't discover this until you have an incident. This is the argument for model diversity in your review pipeline — not because any single model is bad, but because single-model monoculture removes variance, and variance is what catches edge cases.

The "It Didn't Ask Me Any Questions" Problem

A good senior engineer reviewing your code asks clarifying questions. "What's the expected scale here?" "Is this running in a transaction?" "Did you consider what happens if this queue is empty when this fires?" ChatGPT, by default, doesn't ask questions — it answers them. It fills in the blanks with assumptions and gives you a review based on those assumed constraints.

When I reviewed the SQL injection code myself after ChatGPT blessed it, the first question I asked was: "Is this value ever coming from user input?" It was. ChatGPT had no way to know that without asking, and it didn't ask. It assumed the value was trusted and reviewed accordingly.

The best code review tooling needs to surface what it doesn't know, not paper over it with confident-sounding prose.

A Framework for Evaluating AI Code Review Tools

Before committing to any AI tool in your review workflow, run it through these five checks:

1. The Adversarial Prompt Test
Paste intentionally broken code. Don't tell the model it's broken. See if it finds the issues without prompting. If it praises the code, disqualify it.

2. The Assumption Surfacing Test
Ask the tool to review a function with an ambiguous external dependency. Does it ask about the dependency, or does it assume and proceed? Tools that assume are dangerous.

3. The Cross-File Coherence Test
Give it two files where a contract between them is violated. Does it catch the violation? This tests whether the tool actually reasons across context or just pattern-matches within files.

4. The False Positive Rate Check
Good security review catches real issues. But a tool that flags everything as potentially vulnerable is noise, not signal. Track how often its findings are actionable vs. generic warnings.

5. The Model Diversity Test
If the tool runs on one model, ask the vendor what the blind spots are. If they say "none," stop using the tool. Every model has blind spots. Honest tooling acknowledges them.

How AI Handler Approaches This

Everything above is what I'm building against. AI Handler is a unified AI workflow tool designed for developers who are serious about getting actual signal from AI — not confidence theater.

For code review specifically, AI Handler routes your review across multiple models simultaneously, then synthesizes where they agree and flags where they diverge. Divergence is often the most important signal: when two models disagree about whether something is safe, that's exactly where a human needs to look. Consensus gives you confidence. Disagreement gives you a checklist.

AI Handler also tracks the context each model used to generate its review, so when a finding is based on an assumption, that assumption is surfaced — not buried. If the model assumed your input was sanitized, you see that assumption explicitly. You can then confirm or override it and get a revised review.

The other piece is workflow integration. The problem with ChatGPT code review is not just the model — it's that the review lives in a chat window, disconnected from your PR, your CI pipeline, and your incident history. AI Handler connects those. A finding that matches a pattern from a past incident gets flagged with that incident as context. That's institutional memory, not just static analysis.

I'm not claiming AI Handler solves every problem I described above. I'm claiming it's built by someone who got burned by all of them and is taking it seriously.

AI Handler is the unified AI workflow tool I am building. Launching June 2026. Email ceo@eternalsix.com for beta access.

DEV Community