Blog

How we built a 6-model AI debate engine

Polymind2026-05-243 min readengineering · multi-model · product

Polymind looks like a chat app, but the product shape underneath is closer to a small distributed system. One user prompt fans out to several model providers, answers stream back at different speeds, optional critique rounds create a shared context, and a judge model synthesizes the final response.

The core idea is simple. The implementation is mostly about keeping that idea honest under latency, failures, quotas, and provider differences.

The product loop

Every Polymind run follows the same broad sequence.

The user writes one prompt.
Polymind sends it to multiple panelist models in parallel.
If critique is enabled, each panelist sees the others' answers and revises.
A judge model reads the final answers and writes a synthesis.
Public runs feed the leaderboard through judge picks.

This shape gives the user two outputs: the final synthesis and the spread of model opinions that led to it.

Fan-out first

The backend is a FastAPI service. Provider API keys live server-side, not in the browser. The debate endpoint fans out to provider adapters so Anthropic, OpenAI, Gemini, Perplexity, Grok, Mistral, Qwen, and future providers can be treated through a common interface.

The important design choice is parallelism. Asking six providers one after another would make the slowest full panel painfully slow. Fan-out turns the wall-clock time into "roughly the slowest provider" rather than "sum of all providers."

Provider failures are expected. A multi-model product should degrade usefully when one provider stalls or errors. The run can still show the answers that arrived, mark the failed panelist, and continue where the user still gets value.

Streaming is the user experience

Long AI calls feel broken if the screen is silent. Polymind streams events over SSE so the frontend can show panelists as they begin, stream text as it arrives, and keep the run legible while work is in flight.

This matters more in a multi-model system than in a single chatbot. Different providers have different first-token latency, throughput, and failure behavior. Streaming turns that variance into visible progress instead of a spinner.

It also makes error boundaries clearer. A panelist can finish, stall, or fail independently. The UI should reflect that granularity.

Critique rounds

The critique round is where Polymind stops being just parallel chat. After the first answers arrive, each panelist can see the other answers and revise its own.

That creates a lightweight debate loop. Models can correct mistakes, borrow useful framing, or double down on a disagreement. The judge then sees the revised final answers, not only the first draft.

Critique is capped by tier because it multiplies cost. Depth 0 is a simple panel. Higher depth buys more deliberation at higher provider spend.

The judge and leaderboard

The judge writes the final synthesis. It also names which panelists it leaned on. Those named panelists become picks, and picks are the raw material for the leaderboard.

The leaderboard does not sort by raw pick rate. It sorts by Wilson lower bound so small-sample luck is punished. That is the difference between "this model got picked once" and "this model keeps getting picked."

The methodology page documents the details because judge-based rankings need visible caveats. The judge is a model. It has preferences. The leaderboard measures those preferences inside Polymind's run shape.

Persistence and sharing

Completed runs can be saved and revisited. Public debate pages give the system a shareable artifact: prompt, panelist answers, judge synthesis, picks, and metadata.

That persistence is also what makes SEO and research surfaces possible. A public run is not just an app state; it is a citeable page with structured data, an OG image, and links into the model and leaderboard pages.

Trust boundaries

The main trust boundary is provider access. Anything that calls model APIs stays server-side. Quota enforcement, spend tallying, tier guards, and provider keys are backend responsibilities.

Some user-scoped preferences can live closer to the browser through Supabase and row-level security, but model execution cannot. Splitting that boundary carefully prevents product logic from drifting into two backends.

What we learned

Multi-model AI is less about asking six models because six is a big number. It is about making agreement and disagreement visible.

The engineering follows from that. Stream every panel independently. Preserve failures without collapsing the whole run. Let critique add deliberation when it is worth the cost. Make the judge's choices inspectable. Publish the methodology so the leaderboard is a claim a reader can audit.

That is the shape of Polymind: one prompt, many answers, one synthesis, and a public trail of which models helped. The source lives in the Polymind GitHub repository if you want to inspect the implementation rather than just the shape.