Skip to content

Blog

Notes on multi-model AI, from the data up.

Leaderboard reads, methodology deep-dives, and reactive write-ups on new frontier models. Also available as an RSS feed.

  1. 4 min readllm-council · multi-model · product

    Karpathy's LLM council, without the setup: a hosted multi-model panel

    Andrej Karpathy's llm-council put the multi-model 'council' pattern on the map — but it's a local repo you clone, configure, and run with your own OpenRouter key. Here's what it does, what running it costs you in setup, and how to use the same idea hosted.

    Read post
  2. 4 min readconsensus · multi-model · methodology

    What is consensus in AI? A taxonomy

    AI consensus is not one thing. Models can agree on the final answer, the reasoning path, the uncertainty, or only the next step. Knowing which kind of consensus you have changes how much to trust it.

    Read post
  3. 4 min readcode · leaderboard · methodology

    Best AI for code in 2026

    The best AI for code is the one that survives real implementation prompts, not the one that wins a single demo. Use live code-domain rankings, sample sizes, and side-by-side review before trusting any coding model.

    Read post
  4. 4 min readcreative · leaderboard · consensus

    Best AI for creative writing in 2026

    The best AI for creative writing is not the model with the prettiest first draft. It is the one that can hold voice, revise toward intent, and make useful trade-offs when multiple good answers exist.

    Read post
  5. 6 min readlegal · leaderboard · methodology

    Best AI for legal research in 2026 (data-driven)

    The best AI for legal research is not the model with the loudest demo. It is the model that wins repeatably on legal-style prompts, with sample size visible, judge bias named, and caveats kept close to the number.

    Read post
  6. 3 min readmedical · leaderboard · methodology

    Best AI for medical questions in 2026 (with caveats)

    The best AI for medical questions is the one that is useful without pretending to be a clinician. Read live medical-domain rankings as a research signal, not as diagnosis, treatment, or emergency guidance.

    Read post
  7. 3 min readresearch · leaderboard · methodology

    Best AI for research in 2026

    The best AI for research is the model that can map a question, expose uncertainty, and help you verify sources. Use live research-domain rankings as a shortlist, then make the evidence do the final work.

    Read post
  8. 3 min readmethodology · llm-as-judge · leaderboard

    How LLM-as-a-judge works (and where it fails)

    LLM-as-a-judge turns model evaluation into a structured comparison: show a judge the candidate answers, ask for a decision, and aggregate many decisions. It is useful, but only when its biases are visible.

    Read post
  9. 3 min readengineering · multi-model · product

    How we built a 6-model AI debate engine

    Polymind is a fan-out engine: one prompt goes to several providers, optional critique rounds let models revise, and a judge synthesizes the final answer. The hard parts are orchestration, streaming, persistence, and trust boundaries.

    Read post
  10. 2 min readreference · media · leaderboard

    Polymind for journalists: a one-page reference

    A concise reference for journalists covering Polymind, multi-model AI, and judge-pick leaderboards: what the product does, what the leaderboard means, and what not to overclaim.

    Read post
  11. 2 min readresearch · citation · dataset

    Polymind for researchers: how to cite our data

    Polymind publishes live leaderboard pages and machine-readable data so researchers can cite the method, inspect the numbers, and avoid treating a changing leaderboard as a timeless claim.

    Read post
  12. 6 min readmethodology · leaderboard · multi-model

    How to honestly compare LLMs: a methodology

    Public benchmarks saturate, leaderboards get gamed, and 'I tried it on my own question' is anecdote. Here is the rough shape of an honest comparison — same prompt, parallel models, post-hoc judging, enough samples to mean anything — and how Polymind tries to do it.

    Read post
  13. 5 min readmulti-model · consensus · product

    Multi-model AI: why one chatbot isn't enough

    The single-chatbot habit hides how often frontier models disagree. Querying many in parallel turns that disagreement into the most useful thing on the screen — and consensus, when it happens, into something you can actually rely on.

    Read post
  14. 4 min readleaderboard · methodology · multi-model

    Which AI wins most often in 2026? Polymind's open leaderboard

    Polymind runs frontier LLMs side-by-side, then asks a judge model which answer it leaned on. Six months of those judge picks, ranked with a Wilson lower bound, give an opinion on which AI actually wins — not on a one-shot benchmark, but across a growing body of real questions.

    Read post
  15. 7 min readmethodology · leaderboard · statistics

    Wilson lower bound vs raw win rate: why LMSYS and Polymind sort this way

    Ranking models by raw win rate puts the noisiest, smallest-sample contenders at the top. The Wilson lower bound is the standard fix — same trick Reddit uses for comments and LMSYS uses for Elo intervals. Here's the math, a worked example, and where the fix breaks.

    Read post