Case Study #3
Passing tests didn't mean it was safe to ship.
74 files, 10 end-to-end tests, all passing. Then two independent reviewers analysed the same code in parallel. They found 17 issues the builder missed, and 14 of them sat outside what the test suite could see.
10/10
Tests passing
14
Issues found after
50%
Unique per reviewer
<$0.10
Review cost
Specific engine names and their skill-specific combinations are proprietary to MegaLens. Reviewers are referred to by role to protect our intellectual property.
What testing caught, narrowly
During E2E testing, 3 bugs surfaced that unit tests would not have caught.
A dependency had changed its interface.
The code was correct for last month's version. This month's version rejected the same inputs silently — no error, just empty output that looked like a valid response.
A data format assumption was wrong.
The parser expected plain text. The source now returned structured data. Results came back empty. No crash. No warning. Just quietly wrong answers downstream.
A filtering rule was applied at the wrong layer.
The system correctly identified which components to exclude, but applied the exclusion after selection instead of before it. The wrong component was chosen as primary, then skipped at runtime.
All 3 were the kind of issue that can clear a clean-looking build and still break when real traffic hits the odd paths.
What review found after the tests passed
Two reviewers ran in parallel against the full codebase. 14 additional issues. Zero overlap with the 3 issues testing caught.
| Category | Count | Why testing missed it |
|---|---|---|
| Concurrency | 2 | Only triggers under specific timing — process exit during timeout window |
| Input validation gaps | 3 | Enforced in some code paths but not others |
| Silent failures | 3 | Functions returned empty success instead of errors |
| Logic errors | 4 | Correct intent, wrong implementation — output degraded, not broken |
| Credential exposure | 2 | Error messages could leak sensitive data in specific failure modes |
Cross-validation
7
Both reviewers flagged independently
4
Only Reviewer 1 caught
3
Only Reviewer 2 caught
Half the findings required a second perspective.Using only one reviewer would have missed 3–4 issues, including one concurrency bug and one validation gap.
What we fixed
14 of 17 total issues fixed in the same session. 3 deferred with documented risk acceptance (non-critical, mitigated by other controls).
- Concurrency guards added where timing-dependent failures were possible
- Input validation consolidated to a single enforcement point (was scattered across 3 locations)
- Empty-response detection added to prevent silent downstream failures
- Error output truncated and filtered to prevent credential data in logs
No architectural changes were required. The design held up. The implementation needed tightening.
The Numbers
10
Tests designed and passed
3
Issues found during testing
14
Issues found by independent review
7 (50%)
Both reviewers agreed on
7 (50%)
Only one reviewer caught
14/17
Fixed same session
Review cost covers only the AI inference charges. It does not include development time, local compute, or the build phase. We flag this because <$0.10 is a memorable number and we don't want it to do more work than it should.
What this changed our mind about
Not that AI review is perfect, and not that it replaces human judgment. The useful claim here is narrower:
Testing and review catch fundamentally different defect classes.
Testing catches integration failures and broken paths. Independent review catches concurrency issues, validation inconsistencies, and silent-failure patterns that produce correct-looking but wrong output.
A single reviewer has blind spots.
The 50% unique-finding rate between the two reviewers is the number that matters most. It means one reviewer, no matter how capable, will miss things a second independent reviewer can catch. That is not a theory here. It is what we measured on our own code.
Limitations
This was one feature, one session, and two reviewers. The findings are real, but the sample is small. We do not claim these ratios carry cleanly across all codebases.
Some of the 14 “issues” were hardening opportunities rather than exploitable vulnerabilities. We counted them because they reduced real risk, but a stricter triage might score 9–10 as actionable and 4–5 as advisory.
AI reviewers also generate false positives and miss things a human reviewer would catch from domain context. This process is additive. It does not replace human review.
The meta note:This particular feature was the engine that powers MegaLens's own multi-reviewer workflow. The product reviewed its own code. We mention this not as a marketing flourish, but because it was the most honest test available: if the methodology cannot improve the code that implements the methodology, it does not work. It did.
Implementation details, tool names, and architecture specifics are intentionally omitted. We share the process and the numbers, not the blueprint.
Try MegaLens Free