Case Study #4
The SSRF fix that passed its tests and was still unsafe to ship.
A production email-verification SaaS had 7 security findings from an internal audit. One engineer, one session, one coordinated deploy. The first-pass fixes looked competent and complete. Independent review caught two bugs hiding inside the patches themselves.
7
Findings patched
4
Files touched
2
Review rounds
Same day
Deploy
Independent reviewers for this session: Gemini (Google) and Codex (OpenAI). Specific exploit payloads, endpoint paths, and credential values are omitted. We share the process and the numbers, not the blueprint.
Round 0: the fixes that looked finished
The engineer worked through the findings list the way most good engineers would. Suspended users were blocked at the database layer. The bulk verification flow moved from check-then-deduct to pre-debit plus refund. Hardcoded secret fallbacks were removed and replaced with fail-fast environment checks. Upload handlers stopped trusting client-reported file size and switched to bounded reads.
The SSRF fix also looked clean on paper. The product verifies email domains by resolving mail servers and opening SMTP connections to them. The first-pass patch added a public-IP filter before any outbound connect. It rejected private, loopback, link-local, multicast, reserved, and unspecified ranges using Python's ipaddress module.
It passed its tests.
That still wasn't enough to ship.
Round 1: Gemini catches the bugs behind the fixes
Gemini reviewed the full diff with the original findings and surrounding code paths in view. It found two problems that mattered precisely because the fixes looked correct.
Catch #1 — Dead code introduced by a correct-looking patch
The suspended-account patch accidentally made a 403 response unreachable.
The initial change tightened the user lookup so disabled accounts were filtered out early. That closes the obvious hole. But it also changed behavior somewhere else in the product: an existing branch that was supposed to return a specific “your account is suspended” response became permanently unreachable.
A suspended user with a still-valid token would now hit a generic unauthorized path instead of the intended suspended-account path.
That is exactly the kind of bug that slips past review because the patched query looks safer. Gemini caught it because it read the caller, not just the patch. The final design became asymmetric on purpose: strict where fresh authentication happens, lenient enough where session validation needs to preserve the suspension-specific response.
Catch #2 — SSRF filter bypassable via IPv6 tunnel forms
A cloud-metadata endpoint wrapped inside an IPv6 transition address slipped through.
The first patch checked whether a resolved address looked globally routable. That works for ordinary IPv4 and IPv6 literals. It does not work if the address is an IPv6 wrapper carrying an embedded IPv4 target inside it. In those cases, Python evaluates the outer IPv6 object unless you explicitly unwrap the embedded address first.
So the fix blocked the obvious private targets, but a cloud metadata endpoint wrapped inside an IPv4-mapped or 6to4 transition address could still pass the “public” test and slip through.
Gemini flagged it immediately. The patch was rewritten to unwrap IPv4-mapped, 6to4, and Teredo-style tunnel forms before evaluating whether the destination was public. After that rewrite, fourteen test cases passed, including the exact tunneled cloud-metadata vector that the first version missed.
This would have shipped without a second brain.
The important part: the Council was not a yes-man
The value here was not just “more review.” It was adversarial review.
Gemini also pushed on two points the engineer did not accept.
DNS rebinding on the MX lookup path.
Gemini pushed harder on this residual risk. The engineer kept it as a documented Phase 2 issue rather than a release blocker for this patch set, because fixing it properly means pinning resolved IPs through the later connection path instead of pretending one more filter closes the window.
Fail-fast-at-import for shared-module env checks.
Gemini pressed on the startup behavior. The engineer held the line on the chosen boundary for this deploy, with production environment injection handled deliberately before restart.
This matters because “multi-model” only works if disagreement is allowed to stay alive long enough to be tested.
Codex then reviewed the disputed points independently and landed on the engineer's side for both. Round 2 ended with dual SHIP verdict.
That is the part people tend to miss when they hear “council.” The goal is not to pile up agreement. The goal is to create a structure where one reviewer catches what the fixer missed, another reviewer can challenge that review, and a third perspective can arbitrate without being anchored to either side.
Round 2: Ship
By the end of the session, the blocker and high-severity code fixes shipped in one coordinated deploy. The concrete shape of the work was small and surgical.
3
Blocker findings
4
High-severity findings
4
Files touched
14
Tunnel-aware test cases
2
Review rounds
None
Production incident
These were live security patches on a public SaaS with billing logic, authentication, and outbound network behavior in the blast radius. Production deploy happened the same day.
What this session actually showed
Single-model review has a blind spot that is easy to underestimate.
A model that writes a patch is usually anchored to the vulnerability it is trying to close. It checks, “Did I block the thing?” It is much worse at asking, “What did this change quietly break one layer up?” or“What weird representation still passes my validation logic even though the ordinary form does not?”
That is why the two best catches in this session were both second-order problems: not old bugs left untouched, but bugs introduced or preserved by remediation that looked correct on the surface.
One brain can produce a plausible patch.
The first-pass fixes looked like what a senior engineer would sketch on a whiteboard. They were not careless. They were anchored.
A second brain catches the non-obvious mistake.
Gemini read the calling code, not just the patch, and the representation space, not just the happy path. That is the gap-fill mode peer review is actually good at.
A third brain resolves whether the second brain is right.
Codex independently adjudicated the two points where the engineer disagreed with Gemini. Council only works if the third voice can break ties without being anchored to either side.
Limitations
This was one product, one engineer, one session, four touched files, and one coordinated deploy. Not a benchmark. A real remediation sprint on a live system.
This does not replace static analysis, dependency scanning, or human security review. Those tools catch different failure modes. The lesson here is narrower and more useful: independent council review is unusually good at catching fix-on-fix failures, dead-code security controls, and edge-case bypasses that survive both implementation and ordinary review.
AI reviewers also produce false positives and miss things a human reviewer would catch from domain context. This process supplements human review. It does not replace it.
Why this matters for MegaLens: This is the MegaLens thesis in miniature. The product runs multiple debaters plus peer judges plus a supreme reviewer on every audit for exactly this reason. Single-model output is a plausible first draft. Council review is the mechanism that catches the class of mistakes that survive single-brain review, including mistakes introduced by the fix itself.
In this case, that difference was the gap between “patched” and “safe to ship.”
Try MegaLens Free