Polymind LLM judge-pick leaderboard (all)

Polymind

LEADERBOARD

Which model wins most often?

As of 2026-07-25, Claude leads the leaderboard with 28 picks across 47 runs (59.6% win rate, Wilson lower bound 0.45). #2 is GPT, #3 is Perplexity.

Every Polymind run ends with a judge model picking the panelist(s) it leaned on. We tally those picks across every run, ever. Rankings use the Wilson 95% lower bound — sample size matters, not just raw rate.

Try a question of your own

Ranked by Wilson lower bound

How is this measured? →

#ModelPicksRunsWin rate

1Claude284759.6%
2GPT124626.1%
3Perplexity83423.5%
4Gemini94320.9%
5Grok73520.0%
6Mistral54311.6%
7Qwen32512.0%

50 runs · across every domain · refreshed every 5 minutes · methodology v2

Try a question of your own

One tap seeds the home prompt — you'll see how every model on the table above actually performs.

Common questions

Which AI model wins most often on Polymind?

The table on this page lists every panelist ranked by Wilson 95% lower bound — the live answer to "who has the strongest track record across every domain." We rank by the lower bound (not raw win rate) so a 1/1 model can't outrank an 80/100 model just by being new. Refreshes every 5 minutes.

How is the ranking measured?

Every Polymind run ends with a judge model nominating 1–3 panelists it leaned on most. We count those as picks. The ranking column win_rate_lower = wilson_lower_bound(picks, appearances, z=1.96) is the same correction LMSYS Arena and Reddit's "best" sort use. Picks and appearances are aggregated across every domain, across 50 contributing runs as of the last refresh.

What counts as a 'pick'?

Picks come from the judge, not from users. Every Polymind run ends with a judge model synthesising the panelists' final answers and appending a private marker naming the one to three panelists it leaned on most. Each named panelist gets one pick for that run. The judge does not get a pick for its own synthesis.

Why rank by the Wilson lower bound instead of raw win rate?

Raw picks/appearances is unstable when sample sizes are uneven. The Wilson 95% lower bound asks: 'given the picks I've seen, what's the lowest the true win rate could plausibly be?' That lets us compare a 4/5 model and an 80/100 model honestly — the 4/5 has a much wider confidence interval, so its lower bound is lower. Same correction LMSYS Arena uses to sort its Chatbot Arena leaderboard, and the same one Reddit uses for its 'best' comment sort.

Can I trust a row with very few appearances?

Any row with fewer than 10 appearances renders faded with a 'needs more data' hint. The ranking position is still derived from the Wilson lower bound — which is precisely what corrects for thin samples — but visually we signal that the order is provisional until that row crosses the threshold.

How current are these numbers?

The aggregation has a 5-minute TTL cache and the page revalidates on the same clock. A newly-completed run appears in the totals within ~5 minutes.

Where can I download the raw data?

Every leaderboard slice has a machine-readable JSON twin at /leaderboard/{domain}/data.json — same numbers as the HTML page, CC-BY-4.0, CORS-open, 5-minute cache. The all-domain feed for this page is at https://polymind.cloud/leaderboard/all/data.json.