Skip to content

Blog

Wilson lower bound vs raw win rate: why LMSYS and Polymind sort this way

Polymind7 min readmethodology · leaderboard · statistics

Almost every leaderboard you have seen sorted by "win rate" or "accuracy" or "% preferred" is misranking the top of the list. Not by a little, and not because of a bug. It is misranking because raw rates treat a model picked once out of one as identical to a model picked nine hundred and ninety-nine times out of a thousand. Both are 100% in raw terms; one of them is a tossed coin and the other is a wall clock.

The standard fix has a name — the Wilson score interval lower bound — and it has been the right answer since 1927. Reddit adopted it for "best" comment sorting in 2009. LMSYS Chatbot Arena publishes Elo confidence intervals in the same spirit. The Polymind leaderboard sorts on it. This post is a tour of why, with the math, a worked example, and the places the trick stops working.

The problem, stated once

Imagine two new restaurants in town. Both have a single five-star review on Yelp. By raw average rating, they tie. You would not infer from this that they are equally good restaurants. The honest intuition is: I don't know yet, and I won't know until more people visit.

Now imagine a third restaurant with 800 five-star reviews and 200 one-stars. Raw average: four stars. Strictly lower than the five-star restaurants. By any sensible ranking, this third place is the most defensible recommendation of the three — and yet a raw-rate sort buries it below two unknowns.

This is exactly what raw win-rate sorting does to LLM leaderboards. A model that has appeared once and got picked by the judge sits at 100%. A model that has appeared a thousand times and got picked 770 times sits at 77%. Sort descending and the new arrival outranks the established workhorse, until enough samples accumulate for the gap to flip. In the meantime, the leaderboard is wrong, the screenshots get posted, and the wrong model gets recommended.

What the Wilson lower bound actually computes

Given n trials and k successes — for Polymind, the number of debates a model appeared in and the number of times the judge picked its answer — the raw rate is p̂ = k / n. The Wilson 95% lower bound is the smallest value the true underlying success rate could plausibly be, given how few samples we have.

The closed-form expression looks like this:

                     1                 z²            z   √( p̂(1-p̂)/n + z²/(4n²) )
lower = ──────────── × ( p̂ + ────  −  z  ×  ──────────────────────────────────────── )
         1 + z²/n           2n

with z ≈ 1.96 for a 95% interval. The first term is the raw rate. The second is a small "pull toward 0.5" correction that fades as n grows. The third is the asymmetric uncertainty — wider when n is small, wider near the extremes, narrower in the middle.

You don't have to remember the formula. You only have to remember its shape: as n grows, the lower bound climbs toward the raw rate; as n shrinks, it collapses toward zero. The leaderboard is sorted by the bottom of the bar, not the middle, so models with few samples can't camp at the top.

The 1-of-1 vs 800-of-1000 worked example

Plug numbers in.

A model picked 1 out of 1 times: raw rate 100%. Wilson lower bound at 95% confidence: about 20.7%. The interval is saying "yes, the point estimate is one — but with only one sample, the true rate could plausibly be as low as one in five."

A model picked 800 out of 1000 times: raw rate 80%. Wilson lower bound: about 77.4%. The interval has tightened enormously because the sample is large. The point estimate moved a little; the lower bound moved a lot.

Sort descending by lower bound: the 800-of-1000 model is on top by fifty-six points. Sort by raw rate and the 1-of-1 model "wins" by twenty. The first ordering is the one a serious reader actually wants; the second is what an unmoderated leaderboard ships.

This is also exactly the reason Reddit's "best" sort doesn't put a new comment with one upvote at the top of every thread. Evan Miller's 2009 piece, How Not to Sort By Average Rating, walked through the same arithmetic against star reviews; Reddit adopted it within months. It has been the load-bearing default for honest rate-based rankings ever since.

Where this shows up in LLM evaluation

LMSYS Chatbot Arena doesn't use the Wilson interval directly — it fits a Bradley–Terry / Elo model — but the spirit is identical: a new arrival on the leaderboard gets a wide confidence interval, and the rank takes that interval into account, not just the point estimate. The Arena UI displays the interval as a ± next to the score; the score itself is shrunk toward the prior until enough pairwise comparisons exist to pin it down.

Polymind takes the same idea and applies it to judge picks. Every public debate produces a synthesis from a judge model, and as part of that synthesis the judge names which panelists it leaned on. Over thousands of debates, a model has been picked k times across n appearances. The rank order on the leaderboard is the descending Wilson lower bound of k / n, not the descending raw rate.

This is the correction that lets the leaderboard show a stable top of the list even when the panel composition is constantly shifting under it. A new model that lands one early pick doesn't shoot to number one. An established model with hundreds of appearances doesn't get dethroned by a rookie's lucky run. The per-domain views — code, legal, creative — sort the same way, which means a model that wins on the home page but hasn't been asked many code questions correctly ranks lower under "Code" until the code samples accumulate.

Where the Wilson lower bound is wrong

It is a tool, not a truth, and it has at least four limits worth naming.

It assumes the trials are independent. Two judges in a row both picking Claude after Claude wrote a particularly persuasive answer is not quite the independent Bernoulli setup the formula assumes. In practice the correlation is mild for LLM-as-judge work because the panel composition rotates, but it's there.

It ignores the judge's bias. The formula tells you how confident you can be in the rate of picks, but not whether the picks themselves reflect quality. If the judge consistently prefers longer answers, the lower bound is rigorously computing how often the judge prefers this model's longer answers — which is a real number, just not the one a naïve reader thinks they're getting. The methodology page names this bias explicitly so a reader citing the number can't claim they didn't know.

It rewards consistency over peak. Two models with the same lower bound can have very different score distributions. A model that always lands in the 75–80% zone can outrank one that lands between 60% and 99% with the same long-run average. For most "which model should I default to" questions this is the right behaviour — defaults should be reliable, not heroic. For "which model has the highest ceiling," the lower bound is the wrong tool and you should look at raw rates with their intervals shown.

It can be gamed by sample-size manipulation. If you control which questions go to which models, you can keep a favored model's n small until its early-luck k produces a Wilson lower bound that looks defensible. Polymind avoids this by computing n from panel appearances — the user picks the panel, not the ranking function — but any system where the operator chooses both is one to look at twice.

A heuristic to take to other leaderboards

Whenever you see a leaderboard ordered by win rate, accuracy, preference rate, or any other rate-of-success metric, look for the sample sizes. Then ask three questions.

  1. Is the top of the list dominated by entries with order-of-magnitude fewer samples than the established ones? If yes, the leaderboard is sorted by raw rate and you should mentally re-sort.
  2. Are the confidence intervals shown anywhere? If no, the leaderboard is making a claim it can't defend.
  3. Does the leaderboard tell you, in prose, what statistic it sorts by? If no, assume raw rate and treat the rankings as anecdote.

A leaderboard that survives those three questions is doing the work. A leaderboard that doesn't is a screenshot.

If you want to verify the Polymind implementation against the formula, the full math and the precise constants live on the methodology page, and the JSON twin lets you pull the raw k, n, win_rate, and win_rate_lower and re-derive the rank order yourself. That is the point of publishing the lower bound: the ranking is a function the reader can independently check, not an opaque sort key.

Related

  • Best AI for code in 2026

    The best AI for code is the one that survives real implementation prompts, not the one that wins a single demo. Use live code-domain rankings, sample sizes, and side-by-side review before trusting any coding model.

  • Best AI for legal research in 2026 (data-driven)

    The best AI for legal research is not the model with the loudest demo. It is the model that wins repeatably on legal-style prompts, with sample size visible, judge bias named, and caveats kept close to the number.

  • Best AI for medical questions in 2026 (with caveats)

    The best AI for medical questions is the one that is useful without pretending to be a clinician. Read live medical-domain rankings as a research signal, not as diagnosis, treatment, or emergency guidance.