Chain of Recursive Thoughts: Making an AI Argue With Itself
In May 2025 I published Chain of Recursive Thoughts (CoRT) — a small experiment in making a language model improve its own answers by generating competitors and judging them. It went from ~20 stars to over 2,300 within days of hitting #2 on Hacker News, where 185+ comments argued about whether it's a real technique or an elaborate way of saying "try again." This post is the honest version of both sides.
How the loop works
The core is around 200 lines of Python and plain prompting — no fine-tuning, no tool use, nothing exotic:
- Draft. The model produces an initial answer to the prompt.
- Decide depth. The model itself estimates how many rounds of rethinking the problem deserves.
- Compete. Each round, it generates three alternative answers, evaluates all candidates against the current best, and keeps the winner.
- Repeat until the rounds run out, then return whatever survived.
The framing that matters: it's not "critique and revise," it's a small tournament. Alternatives compete against the incumbent every round, and an answer only survives by repeatedly beating fresh challengers.
Where it helps
The gains are most visible on smaller models. Testing with Mistral 3.1 24B showed the biggest jumps on programming tasks — the README has side-by-side outputs. Intuitively that tracks: a small model's first draft leaves the most room for a competitive pass to recover, while a frontier model's first draft is already near its ceiling.
The repo grew a web UI and OpenRouter support after the HN wave, so you can point it at basically any hosted model.
What Hacker News argued about
The thread split into three camps, and all of them had a point:
- The enthusiasts — people running similar loops of their own: same model at different temperatures, panels of "personas" with different system prompts, council structures, adversarial debates. For them CoRT was validation that the pattern works.
- The accountants — the obvious objection: every round multiplies token cost. A technique that 4x's your bill for a quality bump needs the bump to be worth it, and that's use-case dependent.
- The skeptics — "techniques like this have been around since GPT-3.5": it's Chain-of-Thought plus self-consistency plus ensembling, repackaged. And the deeper question: does an LLM judging its own outputs actually improve reasoning, or does it just select for plausible-sounding answers?
My take hasn't changed since the repo description: I'll let you decide. It's 200 lines — the cheapest possible way to find out is to run it on your own workload.
FAQ
- Is CoRT the same as Chain-of-Thought?
- No. Chain-of-Thought makes a model show intermediate reasoning within one answer. CoRT generates multiple complete answers and makes them compete across rounds, keeping the winner.
- Does it work with any model?
- It's plain prompting, so anything with an API works — the repo supports OpenRouter, which covers most hosted models. Gains are largest on smaller models.
- What does it cost?
- Each round is three generations plus an evaluation, so expect a multiple of single-shot token cost proportional to the thinking depth the model chooses.