Methodology · v2
How the leaderboard is computed
We publish this in detail because the leaderboard is meant to be cited. The math is straightforward; the trickier parts are what counts as a “run” and what doesn't. This page documents both.
TL;DR.Every Polymind run ends with a judge model naming one to three panelists it leaned on most (“picks”). We rank panelists by the Wilson 95% lower bound of picks / appearances— the same correction LMSYS Arena and Reddit's “best” sort use — so a 1/1 model can't outrank an 80/100 model just by being new. Domain tags (code, legal, medical, creative, research, general) come from a cheap submission-time classifier. Aggregation refreshes every 5 minutes. Current methodology version: v2.
What we measure
Every Polymind run ends with a judge model synthesizing the panelists' final answers. As part of its response, the judge appends a private marker naming the one to three panelists it leaned on most. We record those as picks. We also record every appearance — how many runs each model was on the panel for.
For a given (window, domain) slice, we compute:
win_rate = picks / appearances win_rate_lower = wilson_lower_bound(picks, appearances, z=1.96)
Rankings are sorted bywin_rate_lower, notwin_rate.
That's the same correction LMSYS Arena uses and Reddit's “best” comment sort uses: at 95% confidence, what's the lowest the true win-rate could be given the sample size? A model with 1 win in 1 appearance has a Wilson lower bound near 0.21, not 1.0 — so it can't outrank a model at 80/100 just by being new.
Domain classification
At submission time, Polymind sends the user's prompt to a cheap classifier model (currently the smallest GPT, capped at 8 output tokens). The classifier returns one of:
- code — programming, debugging, system design, dev tooling.
- legal — law, contracts, regulations, compliance.
- medical — health, treatment, fitness, biology of the human body.
- creative — writing, art, design, naming, ideation.
- research — academic / scientific / citation-heavy investigation.
- general — fallback for anything else, including questions the classifier was unsure about.
The classifier runs in parallel with the debate so it adds no user-visible latency, and a 3-second timeout bounds its result. On timeout, error, or any off-allowlist response we tag the run as generalrather than blocking persistence. The classifier caches tags by prompt-hash so repeated questions don't pay the classifier cost twice.
Known limitation: a single classifier call on the prompt alone can mis-tag edge cases (a question about “regex syntax for legal citations” could land in either Code or Legal). We don't correct these post-hoc. The Wilson sort + the sample-size floor protects ranking quality; mis-tagged runs mostly add noise to the wrong tab.
What we publish (and what we don't)
Every completed run on Polymind contributes to the leaderboard. Only three things from each run leave your account: which panelist(s) the judge picked, which models appeared on the panel, and the domain tag. The prompt itself and the model answers stay private to your account — they're never published, never aggregated, never shown alongside the rankings.
Want a specific run off the public board? Delete it from the history sheet. Deletion removes the run from your account and from every aggregation it was counted in; the next 5-minute refresh re-tallies without it.
Refresh cadence
The aggregation query has a 5-minute TTL cache. A freshly-shared run will appear in the totals within ~5 minutes, not instantly, and the CDN-fronted page itself revalidates on the same clock. That's the same trade-off LMSYS makes: live numbers are cheaper to claim than to serve, and a 5-minute window doesn't change anyone's interpretation of the rank order.
Known biases
- Judge-pick bias.The judge is itself a model. Different judges may consistently favor different panelist styles (longer, more confident, more hedged). We don't correct for this — the leaderboard measures “which models do judges pick,” which is what a downstream user actually cares about, not “which models are objectively best.”
- User-base selection bias. Polymind's users aren't a representative sample of all LLM users — the domain mix in the public set skews toward the kinds of questions our users actually ask. That's fine for ranking models against each other on those questions, but should be kept in mind when citing absolute totals.
- Free vs. premium models. The free tier pins each panelist to the cheaper model per provider. Premium runs use whichever models the user picks. A model that's only available to premium users will appear less often than one shipped to both tiers.
- Small-sample noise. Rows with fewer than 10 appearances fade in the table; their order is provisional. The Wilson lower bound mitigates this for the ranking math, but visual treatment helps readers not over-interpret early data.
Citing the leaderboard
Stable URLs:
polymind.cloud/leaderboard/all— unified view.- Domain-segmented views:
/general,/code,/creative,/research,/legal,/medical(General, Code, Creative, Research, Legal, Medical). polymind.cloud/leaderboard/methodology— this page.
Machine-readable JSON twin of every leaderboard slice, suitable for notebooks, dashboards, or AI crawlers — same data as the HTML page, 5-minute cache, CC-BY-4.0, CORS-open:
polymind.cloud/leaderboard/all/data.json- Per-domain JSON:
/general/data.json,/code/data.json,/creative/data.json,/research/data.json,/legal/data.json,/medical/data.json. - Flat CSV snapshot of every slice (header + one row per domain × provider):
polymind.cloud/data/leaderboard-latest.csv.
When citing, please include the methodology version (currently v2) so a reader can pin which math you were looking at. The version increments when the aggregation logic, what counts as a contributed run, or the Wilson z-score changes.