Blog

Which AI wins most often in 2026? Polymind's open leaderboard

Polymind2026-05-214 min readleaderboard · methodology · multi-model

The honest answer to "which AI is best?" is "it depends on the question." The interesting answer is "here is a public scoreboard of every question we've ever run, sliced six ways, with a confidence interval." That scoreboard is the Polymind leaderboard, and this post is a tour of what it currently says.

TL;DR

Polymind asks every panel of frontier LLMs the same prompt in parallel, optionally lets them critique each other, then has a chosen judge model synthesize one answer. The judge nominates the panelists it leaned on. Those nominations — judge picks — are the leaderboard's unit of credit, and the rank uses a Wilson lower bound so a model with 3 picks out of 4 doesn't sit above a model with 800 out of 1000.

The leaderboard is regenerated from live data; the snapshot below is a read taken on 2026-05-21 and will look different by the time you visit. The qualitative picture, though, has been stable for months.

What's actually being measured

Every Polymind run finishes the same way: a judge model — chosen by the user, defaulting to the strongest model the tier allows — reads the final-round panelist answers and writes a synthesis. As part of that synthesis it explicitly names which panelists it found most useful. That's a "pick."

Picks are useful precisely because they're post-hoc. Nobody decided in advance that Claude is good at law and GPT is good at code. The judge just sees the answers, with whatever its own biases are baked in, and points at the ones it preferred. Aggregate enough of those decisions and the per-domain picture sharpens.

The domain itself is also derived after the fact. A separate classifier labels each prompt as code, legal, medical, creative, research, or general. The leaderboard then filters picks by domain so you can read a category-specific ranking — see /leaderboard/code for the developer-facing slice, /leaderboard/legal for legal research, and so on. The methodology page has the full classifier spec.

Why a Wilson lower bound, and not raw win rate

A model that's been picked once and won will sit at 100% raw win rate. The Wilson lower bound asks the much more useful question: "given how few samples we have, what's the most pessimistic plausible win rate for this model?" It rewards consistency and sample size, not luck.

The practical effect is that the top of the leaderboard is dominated by models that have racked up hundreds of picks across hundreds of runs — the bar for displacing them is high, and a model with 4 picks out of 4 trickles to the middle of the table rather than crowning itself.

This is the same statistical move that powers Reddit's "best" comment sort and Amazon's "lowest-rated" filters; we wrote it up properly on the methodology page with the formula and the worked example. Use the methodology version stamp (currently v2) when you cite a snapshot — if the math changes, the version increments.

How to read the leaderboard

Two screens worth flipping between:

/leaderboard/all — every domain rolled up. The right answer to "which AI wins most often?" in the most generic sense.
/leaderboard/<domain> — the slice that matters for your use case. The pick distribution looks meaningfully different per domain; a top-3 model in code can be middle-of-the-pack in creative writing.

Both pages publish their own JSON twin at …/data.json under CC-BY-4.0 if you want to graph the numbers yourself or pull them into a notebook. Same data, no scraping required.

The biases the leaderboard does not pretend to hide

Three biases are worth saying out loud:

Judge bias. The judge is itself an LLM with preferences. When the judge is GPT, GPT panelists win marginally more often than they should. The methodology page documents the magnitude.
Self-selection bias. Users pick which models go on the panel. That biases which models even get a chance to be picked. The leaderboard normalizes by appearances, but it can't normalize for the prompts a Claude-fan asks vs. the prompts a GPT-fan asks.
Domain drift. The classifier is a small model; it gets the easy cases right and the hard cases by majority. A prompt mislabeled as general when it's really legal lands its pick in the wrong bucket. The "all" page is therefore a strictly more reliable read than any single domain page.

We surface all three on the methodology page, in writing, so you can discount the numbers honestly.

What this isn't

The leaderboard is not a public benchmark. The questions are whatever real users ask, which is mostly the long tail of "should I…?" and "explain why…" and "what's the difference between…?" There is no GSM8K, no MMLU, no HumanEval. If you want a single number that lets you say "model X is 4 points better than model Y on grade-school math," this isn't that. If you want "of the 800 times a real person asked a frontier-model panel a real question, here's who the judge actually leaned on," this is exactly that.

How to contribute

Sign in, ask a question, hit Run Polymind. Every completed run contributes anonymously to the next leaderboard regeneration. There is no premium path to nudging the rankings — the free tier and the premium tier both feed the same pick stream.

If you want to cite a snapshot, the leaderboard page has a "Cite this leaderboard" button that generates BibTeX, APA, and Markdown with the methodology version stamped in. We use it ourselves.

The next post in this series will dig into per-domain crossover — where the rank order flips between, say, code and legal. Sign up for the RSS feed if you want to catch it.