The words Polymind uses, defined.
Most of these terms have ordinary English meanings too — but on Polymind they carry specific definitions. This page is the reference the methodology, leaderboard, and about pages all draw from.
- Appearance#
A run in which a given panelist was on the panel — counted whether or not the judge later picked that panelist.
Appearances are the denominator behind every win-rate number on the leaderboard. A model that sits on every panel but never gets picked still accumulates appearances; the Wilson lower bound is what keeps a long streak of zero picks from being flattering.
Related: Pick, Wilson lower bound, Win rate
- Consensus#
The degree to which Polymind's panelists agree on an answer for a given run.
Consensus is read primarily from the judge's own count of how many panelists aligned with the position it adopted, with a lexical overlap score as a fallback, and surfaces in the shared debate view. High consensus suggests low question-level controversy; low consensus is the headline of every dissent debate. Consensus is descriptive, not prescriptive — a confident consensus on a wrong answer is still wrong.
- Critique roundalso: debate depth, depth#
A pass where every panelist sees the other panelists' answers and revises its own.
Depth zero means no critique — every panelist answers once and the judge synthesizes immediately. Depth one through three runs that many critique passes before synthesis. Higher depth surfaces more disagreement but multiplies cost: each added round has every panelist re-read the whole panel, so a deep run across a full panel costs many times the tokens of a depth-zero run.
- Dissentalso: disagreement#
The condition where a panelist's final answer materially diverges from the consensus of the rest of the panel.
Dissent is the signal Polymind exists to surface — a single confident outlier on a question every other model agreed on is the kind of thing worth reading. Dissent is tracked per-run and used to pick the headlines on the upcoming disagreements feed.
- Domain#
The category a prompt was classified into at submission time — one of code, legal, medical, creative, research, or general.
Classification is a separate cheap model call at submission time, capped at 8 output tokens and a 3-second timeout. The 'general' tag is the fallback for anything the classifier was unsure about. Domain tags drive the per-domain leaderboard slices and let a reader compare model rankings on the subset of prompts they care about.
Related: Pick, Wilson lower bound
- Fan-out#
The first step of a Polymind run: sending one prompt to every enabled panelist in parallel.
Fan-out is the architectural reason Polymind exists. The same prompt streams to every panelist's API simultaneously over SSE; a per-provider stall timeout prevents one wedged stream from holding up the rest. The orchestrator only waits as long as the slowest healthy panelist takes — never as long as the slowest unhealthy one.
Related: Critique round, Panelist
- Judge#
The model that reads every panelist's final answer and produces the synthesized response shown to the user.
The judge is chosen by the user, not assigned. Its synthesis is not a vote — it's an opinionated take that may agree with one panelist, blend three, or contradict all of them. As part of its synthesis the judge names the one to three panelists it leaned on most; those names become the picks behind every leaderboard ranking.
Related: Pick, Synthesis, LLM-as-judge
- LLM councilalso: AI council, model council, AI advisory board, AI panel#
A setup where one question goes to several AI models at once, which then review each other before a chosen model synthesizes a single final answer.
It's the pattern Polymind is built on. The council members are the panelists; the model that weighs their answers and writes the final one is the judge. The idea is that a panel that critiques itself catches mistakes a single model would state with equal confidence — and the consensus score plus dissent callout tell you, at a glance, whether the council actually agreed.
Related: Panelist, Judge, Consensus, Critique round
- LLM-as-judgealso: LLM-as-a-judge#
The methodology of evaluating model outputs by having another large language model assess them.
Polymind's leaderboard is built on this method: every run picks panelists via a judge model rather than via human raters or an automated benchmark. The trade-off is honesty — judges have their own biases (style, length, confidence) — but the volume is unbeatable, and the Wilson lower bound corrects for the sample-size noise that plagues smaller human-rated leaderboards.
Related: Judge, Wilson lower bound
- Methodology version#
A monotonically incrementing integer stamped on every leaderboard page describing which version of the aggregation math produced the current numbers.
Version bumps when the aggregation logic, what counts as a contributed run, or the Wilson z-score changes. Citations to the leaderboard should include the methodology version so a reader can pin which math the cited number was computed with.
Related: Wilson lower bound
- Panelist#
One of the LLMs participating in a Polymind run — Claude, GPT, Gemini, Perplexity, Grok, Mistral, or Qwen at present.
Each provider is a panelist; the specific model per panelist depends on the user's tier (free tier pins each to a cheaper model, premium tier opens the full menu). Every panelist sees the same prompt and answers independently in round zero, then sees the others' answers in any critique rounds.
Related: Fan-out, Judge, LLM council
- Pickalso: judge pick#
A nomination by the judge naming one of the panelists it leaned on most when writing its synthesis.
Picks come from the judge, not from users. Every Polymind run ends with the judge appending a private marker listing the one to three panelists it relied on. Each named panelist accrues one pick for that run; the judge does not pick itself. Picks are the numerator behind every leaderboard win-rate column.
Related: Judge, Appearance, Win rate
- Sample-size floor#
The minimum number of appearances a panelist needs before its leaderboard row renders at full opacity rather than faded.
Currently set to 10 appearances. The Wilson lower bound already corrects the ranking math, but visual fading is the additional signal to readers not to over-interpret a panelist with three appearances sitting above one with eighty. The floor will lift once total runs cross into the thousands.
Related: Wilson lower bound, Appearance
- Synthesis#
The judge's final response to the user — a single opinionated answer composed from the panelists' final-round answers.
The synthesis is the artifact the user sees as 'the answer.' Panelist answers are still rendered below it for readers who want to inspect the reasoning, but the synthesis is the headline. Each shared `/d/{slug}` page treats the synthesis as the accepted answer in its QAPage schema.
- Wilson lower boundalso: Wilson confidence interval, Wilson 95% lower bound#
A correction to a raw success rate that asks: given the picks observed, what's the lowest the true win rate could plausibly be at 95% confidence?
Polymind ranks panelists by Wilson lower bound rather than raw win rate so a brand-new 1/1 model can't outrank an 80/100 model just by being new. The same correction LMSYS Arena uses on its Chatbot Arena leaderboard and Reddit uses on its 'best' comment sort. A 1/1 panelist has a Wilson lower bound near 0.21, not 1.0; an 80/100 panelist sits near 0.71.
Related: Pick, Appearance, Sample-size floor, Win rate
- Win ratealso: pick rate#
Picks divided by appearances — the share of a panelist's appearances in which the judge picked them.
Win rate is reported on every leaderboard row but is not the ranking key. Two panelists with identical raw win rates can have very different Wilson lower bounds (a 4/5 model and an 80/100 model both have 0.80 raw win rate, but their lower bounds differ by ~0.20). Read both numbers — but trust the rank order.
Related: Wilson lower bound, Pick, Appearance