Blog

Notes on multi-model AI, from the data up.

Leaderboard reads, methodology deep-dives, and reactive write-ups on new frontier models. Also available as an RSS feed.

2026-05-274 min readllm-council · multi-model · product
Karpathy's LLM council, without the setup: a hosted multi-model panel
Andrej Karpathy's llm-council put the multi-model 'council' pattern on the map — but it's a local repo you clone, configure, and run with your own OpenRouter key. Here's what it does, what running it costs you in setup, and how to use the same idea hosted.
Read post
2026-05-275 min readllm-council · multi-model · consensus
What is an LLM council? How asking many AIs at once beats asking one
An LLM council asks the same question to several AI models, has them review each other, then lets one model synthesize the best answer. Here's how the pattern works, where the term came from, and when a council beats a single chatbot.
Read post
2026-05-244 min readconsensus · multi-model · methodology
What is consensus in AI? A taxonomy
AI consensus is not one thing. Models can agree on the final answer, the reasoning path, the uncertainty, or only the next step. Knowing which kind of consensus you have changes how much to trust it.
Read post
2026-05-244 min readcode · leaderboard · methodology
Best AI for code in 2026
The best AI for code is the one that survives real implementation prompts, not the one that wins a single demo. Use live code-domain rankings, sample sizes, and side-by-side review before trusting any coding model.
Read post
2026-05-244 min readcreative · leaderboard · consensus
Best AI for creative writing in 2026
The best AI for creative writing is not the model with the prettiest first draft. It is the one that can hold voice, revise toward intent, and make useful trade-offs when multiple good answers exist.
Read post
2026-05-246 min readlegal · leaderboard · methodology
Best AI for legal research in 2026 (data-driven)
The best AI for legal research is not the model with the loudest demo. It is the model that wins repeatably on legal-style prompts, with sample size visible, judge bias named, and caveats kept close to the number.
Read post
2026-05-243 min readmedical · leaderboard · methodology
Best AI for medical questions in 2026 (with caveats)
The best AI for medical questions is the one that is useful without pretending to be a clinician. Read live medical-domain rankings as a research signal, not as diagnosis, treatment, or emergency guidance.
Read post
2026-05-243 min readresearch · leaderboard · methodology
Best AI for research in 2026
The best AI for research is the model that can map a question, expose uncertainty, and help you verify sources. Use live research-domain rankings as a shortlist, then make the evidence do the final work.
Read post
2026-05-244 min readproduct · consensus · methodology
Cost vs consensus: does paying for premium AI actually improve answers?
Premium AI only improves an answer when the extra model quality changes the bottleneck. For hard prompts, consensus and disagreement can matter more than a single expensive model.
Read post
2026-05-243 min readmethodology · llm-as-judge · leaderboard
How LLM-as-a-judge works (and where it fails)
LLM-as-a-judge turns model evaluation into a structured comparison: show a judge the candidate answers, ask for a decision, and aggregate many decisions. It is useful, but only when its biases are visible.
Read post
2026-05-243 min readengineering · multi-model · product
How we built a 6-model AI debate engine
Polymind is a fan-out engine: one prompt goes to several providers, optional critique rounds let models revise, and a judge synthesizes the final answer. The hard parts are orchestration, streaming, persistence, and trust boundaries.
Read post
2026-05-242 min readreference · media · leaderboard
Polymind for journalists: a one-page reference
A concise reference for journalists covering Polymind, multi-model AI, and judge-pick leaderboards: what the product does, what the leaderboard means, and what not to overclaim.
Read post
2026-05-242 min readresearch · citation · dataset
Polymind for researchers: how to cite our data
Polymind publishes live leaderboard pages and machine-readable data so researchers can cite the method, inspect the numbers, and avoid treating a changing leaderboard as a timeless claim.
Read post
2026-05-243 min readproduct · multi-model · comparison
Polymind vs ChatGPT vs Perplexity vs You.com: why query one when six are cheap?
ChatGPT, Perplexity, and You.com are strong single-entry AI interfaces. Polymind is different: it asks several models at once, shows disagreement, and turns repeated judge preference into a public leaderboard.
Read post
2026-05-216 min readmethodology · leaderboard · multi-model
How to honestly compare LLMs: a methodology
Public benchmarks saturate, leaderboards get gamed, and 'I tried it on my own question' is anecdote. Here is the rough shape of an honest comparison — same prompt, parallel models, post-hoc judging, enough samples to mean anything — and how Polymind tries to do it.
Read post
2026-05-215 min readmulti-model · consensus · product
Multi-model AI: why one chatbot isn't enough
The single-chatbot habit hides how often frontier models disagree. Querying many in parallel turns that disagreement into the most useful thing on the screen — and consensus, when it happens, into something you can actually rely on.
Read post
2026-05-214 min readleaderboard · methodology · multi-model
Which AI wins most often in 2026? Polymind's open leaderboard
Polymind runs frontier LLMs side-by-side, then asks a judge model which answer it leaned on. Six months of those judge picks, ranked with a Wilson lower bound, give an opinion on which AI actually wins — not on a one-shot benchmark, but across a growing body of real questions.
Read post
2026-05-217 min readmethodology · leaderboard · statistics
Wilson lower bound vs raw win rate: why LMSYS and Polymind sort this way
Ranking models by raw win rate puts the noisiest, smallest-sample contenders at the top. The Wilson lower bound is the standard fix — same trick Reddit uses for comments and LMSYS uses for Elo intervals. Here's the math, a worked example, and where the fix breaks.
Read post