Case Studies/Multi-AI Code Audit

Case Study #2

23 issues found before writing a single line of code.

We had a 7-step UI plan for a complex web app, and on first read it looked solid. Before building, we sent it to two independent reviewers. They found 23 issues across security, architecture, UX, and performance, and the plan changed materially before implementation started.

Issues found

Security risks

Plan modifications

77%

Agreement rate

Specific engine names and their skill-specific combinations are proprietary to MegaLens. Reviewers are referred to by role to protect our intellectual property.

Reviewer 1: the architecture gaps

Looked at the plan like a senior frontend architect and found 10 structural gaps.

Gap

No data model or state management strategy defined

Gap

No loading, error, or retry behaviour specified

Gap

No rules for what persists across sessions vs. what's ephemeral

Gap

No URL routing strategy for conversation history

Gap

No accessibility considerations

Gap

No strategy for rendering rich content safely

Gap

No mechanism to cancel or stop in-progress generation

Gap

No way to retry individual components of a multi-engine response

Gap

No empty states or first-use onboarding

Gap

No frontend test strategy

The pattern:The plan covered screens and layouts, but not enough of the operating model. It answered “what should this look like” better than “what happens when this gets messy.”

Reviewer 2: security, UX, and production risk

Took a broader brief and came back with 13 issues across four categories.

Security (3)

Security

Client-side credential storage exposed to cross-site scripting

Security

Rich content rendering could execute injected scripts

Security

Credential validation requests at scale could trigger provider rate limits

UX (4)

Requiring credentials before first interaction kills conversion

Simulated typing effects frustrate experienced users

Cost estimates for multi-engine queries can't be accurate — showing false precision erodes trust

Multi-panel comparison views don't work on mobile screens

Performance (3)

Performance

Expandable detail views create excessive DOM nodes at scale

Performance

Shared application bundle on the marketing page hurts load time and SEO

Performance

Credential validation on every paste creates unnecessary API load

Production (3)

Production

Platform execution time limits conflict with multi-engine debate duration

Production

Long-lived streaming connections don't scale without connection management

Production

File-based context (upload, paste) missing entirely — limits usefulness for real work

Cross-examination: where they diverged

Reviewer 2's 13 criticisms were sent to Reviewer 1 for validation.

Category	Agreed	Disagreed	Partial
Security	3/3	0	0
UX	4/4	0	0
Performance	1/3	0	2
Production	1/3	2	0

10 of 13 confirmed (77%). The 3 disagreements were about timing, not validity. Reviewer 2 was right that the risks were real. Reviewer 1 was right that they were not launch blockers.

How the plan changed

The original 7-step plan picked up 15 modifications before any code was written.

Security changes

•Credentials stored in runtime memory only — never in browser storage
•Rich content sanitised before rendering — no raw HTML execution
•Credential validation deferred to first use, not on input

UX changes

•Credential input moved from blocking modal to inline prompt
•Simulated typing removed — real streaming progress only
•Cost display changed from exact estimates to honest ranges
•Mobile layout changed from side-by-side panels to stacked cards

Architecture additions (not in original plan)

•Typed event schema for streaming responses
•Cancel/stop mechanism for in-progress generation
•Per-component retry capability
•First-use onboarding and empty states
•Partial failure display (when some engines succeed and others don't)

Deferred intentionally (V2)

•Connection pooling at scale
•File upload and document context
•Full DOM virtualisation

The Numbers

Issues found before implementation

Security risks

UX anti-patterns

Architecture gaps

Performance risks

Production risks

77%

Cross-examination agreement rate

Items deferred after disagreement

Plan modifications applied

Why this mattered before build

Plan review catches a different class of mistake than code review.

Code review finds implementation bugs. Plan review finds missing capabilities, weak assumptions, and architectural decisions that become expensive to reverse once code exists.

Independent reviewers look at the same plan and see different failures.

Reviewer 1 found structural omissions. Reviewer 2 found security, UX, and production risk. There was very little overlap, which is exactly the point.

Disagreement is useful input, not noise.

The 3 disagreements were prioritisation debates. Both reviewers agreed the risks were real; they disagreed about when to deal with them. That is better decision support than a clean but shallow consensus.

Reviewing before implementation is cheaper than rebuilding after it.

Every one of these 23 issues would have been more expensive to fix after code existed. The credential storage change alone, from browser storage to runtime memory, would have touched every component that handles credentials.

Limitations

This was one UI plan reviewed by two AI reviewers. The issues were real and the plan changes were applied, but we are not treating the ratios here as universal.

Some of the 23 “issues” were omissions, such as no cancel button, rather than defects in existing code. We counted them because they were real gaps that would have shipped, but a stricter definition of “issue” would lower the total.

AI reviewers can over-index on theoretical risk. The connection-scaling and file-upload critiques were technically valid but practically early. Human judgment still decided what mattered for launch and what belonged in V2.

Implementation details, tool names, and architecture specifics are intentionally omitted. We share the process and the numbers, not the blueprint.

Try MegaLens Free