Blog
How LLM-as-a-judge works (and where it fails)
LLM-as-a-judge is the practice of using one language model to evaluate the answers produced by other language models. It is not magic, and it is not neutral. It is a scalable review method with specific failure modes.
The reason it exists is simple: human evaluation is expensive and slow, while model output is cheap and fast. If you want to compare thousands of answers across many models, you need some way to score or rank them. A judge model can do that at machine speed.
The hard part is remembering what the score means.
The basic loop
A typical LLM-as-a-judge setup has five steps:
- Give the same prompt to several candidate models.
- Collect their answers.
- Show those answers to a judge model.
- Ask the judge to choose, rank, grade, or synthesize.
- Aggregate the judge's decisions across many prompts.
Polymind uses a synthesis-shaped version of this loop. The judge does not merely output a score. It writes a final answer and names which panelists it leaned on. Those named panelists become picks, and the picks feed the leaderboard.
This is intentionally different from asking "which answer got the gold label right?" Many real prompts do not have a gold label. They ask for advice, trade-offs, code review, explanation, or creative direction. A judge can compare usefulness where exact-answer scoring does not fit.
Why people use it
LLM-as-a-judge is useful because it scales. Human review might be the gold standard, but it is expensive enough that most products cannot use it for every prompt, every model, every day.
It also handles open-ended answers better than exact-match metrics. If two models write different but plausible explanations, a string comparison cannot help. A judge can reason over clarity, coverage, structure, and relevance.
It can be made repeatable. A fixed judge prompt, fixed aggregation method, and public methodology make the evaluation easier to inspect than a private vibes check.
And it produces a useful artifact: disagreement. If the judge leans on different models for different domains, that becomes a map of model strengths rather than a single global score.
Where it fails
The judge is itself a model, so it has preferences.
It may prefer longer answers. It may reward confident prose. It may like familiar structure. It may underrate concise answers that are actually correct. It may share training data or style with one of the candidate models. It may be easier to persuade than to inform.
It can also miss truth. A judge can choose the answer that sounds more complete even when a shorter answer is correct. It can fail to verify citations. It can reward a hallucination if the hallucination is well-written.
Identity leakage matters too. If the judge knows which model wrote which answer, it may import brand expectations into the decision. Blind judging helps, but not every system does it, and style can still reveal identity.
Finally, aggregation can hide variance. A model that is excellent in code and weak in medical questions may look average overall. That is why per-domain leaderboards matter.
How to make it less bad
First, publish the method. A leaderboard that does not tell you who the judge is, what it sees, what it outputs, and how scores are aggregated is asking for trust it has not earned.
Second, show sample sizes. Judge decisions are noisy. A model picked once out of one run is not a champion. This is why Polymind sorts by the Wilson lower bound instead of raw win rate; the Wilson explainer covers the math.
Third, split by domain. A single "best model" number flattens the things users actually care about. The useful question is often "best for code," "best for legal research," or "best for creative writing."
Fourth, keep examples inspectable. The aggregate tells you the pattern. Individual debate pages tell you what the judge was seeing.
Fifth, name the limits. LLM-as-a-judge measures judge preference over candidate answers. It is not the same thing as truth, safety, or human satisfaction.
How to read a judge-based leaderboard
Ask five questions:
- What did every model see?
- What did the judge see?
- Was the judge blind to model identity?
- How many samples support the rank?
- What biases does the publisher admit?
If those answers are visible, LLM-as-a-judge can be a useful evaluation tool. If they are hidden, the leaderboard is mostly a confidence machine.
Polymind's methodology page exists for that reason: not because the method is perfect, but because an imperfect method should be inspectable.