Blog

How to honestly compare LLMs: a methodology

Polymind2026-05-216 min readmethodology · leaderboard · multi-model

If you have ever tried to decide which AI is "best" for the work you actually do, you have probably noticed that nothing you read online is quite the answer to that question. Benchmarks publish saturated scores. Leaderboards rank yesterday's checkpoint. Reviews are written by someone whose use case isn't yours. The honest version of the question is harder, and the honest version of the answer is a methodology, not a number.

This post sketches what that methodology looks like in the abstract, then shows how Polymind implements it. The longer engineering write-up lives on the methodology page; this is the general-audience tour.

Why the easy answers don't work

There are three common ways people compare LLMs, and all three quietly fail.

Public benchmarks. MMLU, HumanEval, GSM8K, GPQA — pick one. Once a benchmark is published, it ends up in training data, intentionally or not, and the score stops being a measurement of capability and starts being a measurement of memorization. Saturation follows, then a new benchmark, and the cycle repeats. A 92% on MMLU in 2026 doesn't mean what 92% on MMLU meant in 2023.

Single-prompt vibes checks. "I asked it my favorite question and it nailed it" is anecdote, not comparison. Anyone who's spent ten minutes generating two completions from the same model with the same prompt knows the variance between samples is enormous; one sample across two models is barely signal at all.

Vendor leaderboards. Each frontier lab publishes its own comparison chart. Those charts are correct in the narrow technical sense and uninformative in every other sense — every lab finds the slice of evaluations where its current model wins.

None of these are scams. They're just trying to answer a different question than "which model should I actually use for the work I do."

What an honest comparison looks like

A defensible head-to-head between LLMs has at least these properties:

Same prompt to every model. The unit of comparison is the question, not the model. If you can't show "model A and model B both got asked exactly this, here's what each said," you're comparing noise.
Parallel, not sequential. Running model A today and model B tomorrow lets infrastructure drift, prompt drift, and your own mood into the comparison. Hit them at the same instant or skip the exercise.
A judge that wasn't told the right answer in advance. A reviewer who knew Claude was supposed to win will find ways to make Claude win. The most honest version uses a third party — often another LLM — that sees only the candidate answers, never the identities of who produced them.
Many questions, not one. The variance from question to question swamps the variance between top frontier models. If you have fewer than a hundred runs, you have an anecdote.
A ranking that respects sample size. A model that's been asked one question and got it right is at 100% accuracy. So is a model asked a thousand and got nine hundred and ninety-nine right. These are not the same situation. Honest rankings care about the difference.
Admitted biases. Every comparison has them. The honest move is to publish the ones you know about, not to pretend they aren't there.

These six are an editorial choice, not a law of nature, but anything calling itself a "comparison" that fails three or more of them is mostly entertainment.

How Polymind tries to do this

Polymind is built around this checklist almost by accident — it started as a multi-model debate tool, and the comparison data fell out of the design.

Same prompt, in parallel. When you ask Polymind a question, that exact prompt fans out to every panelist model at once. Same wording, same instant, same temperature. None of them know the others exist until the optional critique round, at which point they all see each other's answers simultaneously.

Post-hoc judging. The last step of every run is a judge model synthesizing the panelists' final answers into a single response. As part of writing that synthesis, the judge names which panelists it leaned on most — we call those picks. The judge doesn't know in advance which models are even on the panel; it just reads the answers and points at the ones that helped.

Many questions. Every public Polymind run feeds the leaderboard. We're not designing the question set — the users are, by asking whatever they actually need to ask. That makes the data noisier than a curated benchmark and more representative than one.

Sample-size-aware ranking. This is the part most leaderboards get wrong, so it's worth lingering on. Polymind sorts panelists by the Wilson 95% lower bound of their pick rate, not the raw percentage. Translated: instead of asking "what fraction of the time did the judge pick this model?" we ask "given how few samples we have, what's the most pessimistic plausible value for that fraction?"

The practical effect is dramatic. A model picked 1 out of 1 times has a raw pick rate of 100% and a Wilson lower bound of about 21%. A model picked 800 out of 1000 times has a raw pick rate of 80% and a Wilson lower bound of about 77%. The lower bound rewards being right often, not just occasionally and recently. It's the same correction LMSYS Arena and Reddit's "best comment" sort use, and the full formula and worked example is on the methodology page if you want to verify it.

The biases we publish out loud

The leaderboard has three biases worth saying plainly.

The judge is itself an LLM with preferences. Some judges consistently like longer answers; some like more hedged ones. We don't correct for this — partly because there's no obvious correct correction, and partly because "which models do real judges pick" is a more honest description of what we're measuring than "which models are objectively best."

The panel composition is user-chosen. People put their favorites on, which means a Claude fan and a GPT fan ask different questions and build different panels. We normalize by appearances so a model on the panel ten times more isn't ten times more likely to top the chart, but we can't normalize for the shape of the questions a given user tends to ask.

The domain tags come from a cheap classifier, not a human. A question about "regex syntax for legal citations" could plausibly land in either Code or Legal, and we don't post-hoc correct. Misfiled runs add noise to whichever per-domain tab they landed in but don't move the all-domain numbers.

You can read these on the methodology page in their full form, and they're surfaced again at the bottom of the leaderboard itself so a citing reader can't miss them.

What to do with this

If you take one thing from this post, take this: a comparison that doesn't admit its biases isn't more honest, it's less. A leaderboard that hides its sample sizes isn't easier to read, it's wrong more often.

When you read the Polymind leaderboard, notice the badge under the table that says "methodology v2." That number exists because the math has versions, and we'd rather a citation be pinnable than seem timeless. When you read someone else's leaderboard, ask the same six questions: same prompt? parallel? post-hoc judge? many questions? sample-size-aware ranking? admitted biases?

Three yesses out of six is a vibes chart. Six out of six is a methodology.

The next post in this series digs into the data itself — six months of judge picks, where the rank order is stable, and where it's moving. Subscribe via the RSS feed if you want to catch it.