How we review · Methodology v2.4 · Apr 2026

We test every tool the same way, so you can trust the comparison.

Testing human hand, not a scraper

Prompts drafted from real use cases

Trials lived-with, not demoed

Graders humans + AI panel

Verdict one clear call, not a listicle

A score is only as good as the method behind it. This page is the method — every step, every weight, every thing we won't do. If you disagree with a verdict, this is where to start the argument.

Scope the category.

Owner · EditorialCadence · QuarterlyOutput · Category brief + prompt battery

Before any tool is tested, the editors write a category brief: what the tools in this category are actually trying to do, what a good outcome looks like, and what the category edge cases are. The AI coding assistant brief looks nothing like the voice AI brief — and that's the point.

The brief defines the prompt battery: real-world use cases the graders have actually hit in their own work — engineering, writing, research. Every task is drafted from something someone needed to do for real, then tightened to stress a specific dimension. We retire prompts as tools start to ace them and add new ones as the category moves.

Run the prompt battery.

Owner · ReviewersPer tool · Run twice, independentlyDeterminism · temp, seed, prompt fixed

Every tool in a category faces the same prompt set. Same wording. Same seed where supported. Two reviewers run them independently to catch flukes. Outputs go straight into the graded log.

ID	Prompt (abbrev.)	Tests	Pass criteria
cod_01	Refactor a 420-line Python file into typed modules.	Reasoning	Types correct · tests pass · no new bugs
cod_02	Trace and fix a race condition in this Go service.	Debugging	Root cause named · fix is correct · explains tradeoff
cod_07	Migrate a schema from Postgres → Planetscale with zero downtime.	Planning	Order of ops sound · rollback plan exists
cod_12	Add OAuth2 to this Rails app without breaking existing sessions.	Multi-file	Edits 4+ files correctly · migration reversible
cod_18	Summarize the test failures in this CI log and suggest next steps.	Triage	Right failures identified · actionable next step
cod_24	Write a Terraform module that satisfies [policy.yaml] constraints.	Synthesis	All 11 constraints met · no extra resources

Sample prompts shown above. The full battery is kept internal for now — every prompt, temperature, seed, and pass criterion is logged against the same rubric for every tool. We'll open up the battery and its logs category by category as each one stabilises, so readers can inspect and reproduce what we ran.

Sample log entry — cod_02 · Claude Code

{
  "prompt_id": "cod_02",
  "tool": "claude-code-0.18",
  "temperature": 0.2,
  "seed": 42,
  "latency_ms": 8420,
  "tokens_out": 1142,
  "pass_criteria": {
    "root_cause_named": true,
    "fix_correct": true,
    "explains_tradeoff": true
  },
  "grader_1": 9, // Jordan
  "grader_2": 9, // Priya
  "grader_ai": 8
}

Use the tool for real work.

Owner · Reviewers + volunteersDuration · Lived-with, weeks not hoursSignal · “did I keep using it?”

Batteries catch raw capability. Daily use catches everything else: friction, fatigue, the small UX papercuts that make a tool quietly get ignored by week two.

Each tool gets plugged into a real workflow — typically a live team's Slack, Notion, or codebase — and left there for two weeks. The reviewer journals every session. At the end, the team answers one question: did you keep using it voluntarily?

Human graders. Two per tool.

Owner · ReviewersProcess · Independent · blind where possibleDisagreement rate · <8% after calibration

Every prompt output is graded independently by two reviewers per tool on a 1–10 rubric. An AI panel (Claude alongside peer models, same rubric) runs in parallel as a calibration check — not as a vote.

If the two reviewers disagree by 2+ points, a third breaks the tie and we flag the prompt for battery review. We publish the full disagreement rate with every scorecard. Calibration is an editorial KPI.

Cross-check against the web.

Owner · Signals pipelineSources · Reddit, X, Product Hunt, open webCache · 3–7 days

In parallel with the lab work, our signals pipeline pulls fresh discussion about each tool across the open web. Claude structures the raw text into sentiment, top strengths, most-complained-about issues, and pricing moves.

Live signals don't move the RAI score — they're shown next to it. If the web wildly disagrees with our verdict for more than a week, the editorial team triggers a re-review.

The signals pipeline

01 CollectParallel fetches: Reddit · X / Twitter · Product Hunt · vendor site · HN~8s

02 DedupeNear-duplicate filtering · spam/astroturf heuristics · engagement weighting~2s

03 StructureClaude Sonnet → JSON: sentiment, strengths, complaints, pricing moves~4s

04 CachePostgres + pgvector · 3–7 day TTL · invalidated on major vendor update—

05 SurfaceRendered beside the lab verdict · never folded into the RAI score—

The weights, visualized

The six dimensions aren't equal, because in daily use they aren't equal. Performance carries the most weight, followed by the friction dimensions (ease, learning curve) that decide whether a tool sticks.

On the browse page, you can slide these weights to your own use case and the catalog re-ranks live.

The RAI Framework, in full.

Version · v2.4Scale · 0–100 per dimensionWeights · sum to 100%

Six dimensions. The weights below are our defaults — what we think matters most for a general reader. On the browse page, you can slide the weights to your own use case and the catalog re-ranks live.

Performance

Dimension 01 of 06

How well the tool actually does the thing. Raw capability on the prompt battery + real-workflow outcomes.

Measured by avg battery score · pass rate on stress prompts · senior-grader confidence

25^%

Ease of use

Dimension 02 of 06

Interface, onboarding, friction per task. How much work you have to do to get the tool to do its work.

Measured by time-to-first-useful-output · clicks per task · UX complaint count

20^%

Learning curve

Dimension 03 of 06

How long until a new user is productive. Weighted separately from ease — a tool can be easy day one and plateau, or hard day one and keep rewarding you.

Measured by day-1 vs day-14 productivity delta · docs quality · first-week abandonment

15^%

Innovation

Dimension 04 of 06

Does the tool do something genuinely new, or is it a better wrapper around a commodity model? Rewards capability that shifts what's possible in the category.

Measured by capability unique to this tool · category-shift evidence · reviewer novelty score

15^%

Value for money

Dimension 05 of 06

Quality per dollar. Free plans count. Token economics count. Rewards fair pricing even at the top end.

Measured by quality index ÷ monthly cost for a realistic usage profile

15^%

Reliability

Dimension 06 of 06

Uptime, consistency of output, rate-limit predictability. The boring dimension that ruins everything when it's bad.

Measured by uptime over a rolling window · output variance at temp 0 · rate-limit incident count

10^%

The scoring math.

Transparency · Weights + worked exampleNormalisation · 0–100 per dimension

The RAI score is a weighted sum. No magic, no secret sauce, no vendor-specific bumps. Sliders on the browse page recompute it live with your own weights.

Worked example — one tool, coding category

Performance	94 × 0.25	= 23.5
Ease of use	93 × 0.20	= 18.6
Learning curve	91 × 0.15	= 13.6
Innovation	90 × 0.15	= 13.5
Value for money	84 × 0.15	= 12.6
Reliability	93 × 0.10	= 9.3
RAI overall= 91

# Each dimension blends the prompt battery and the real-workflow trial.
# Reviewer scores drive the mean; AI panel logged alongside for calibration.

Reweight any dimension on browse and this total recomputes live. If a dimension can't be measured yet (e.g. a tool that's only weeks old has no reliability window), we publish “n/a” and reweight the remaining dimensions proportionally — surfaced explicitly on the scorecard.

Grader agreement · Q1 2026

Two humans, 93.2% within 1 point.

target ≥ 92%

High freqMediumLow72 graded prompts/week avg · 2 humans

Publish. Then keep publishing.

Owner · Editor-in-chiefReview cadence · Every 90 days or on major updateChangelog · Per-tool, public

A review is never “done.” We re-run the battery on every major vendor release, and at minimum every 90 days. Scores move. Verdicts get rewritten. The per-tool changelog records every edit, with reasons.

The boundaries

Four things we will not do, ever.

Take vendor money to move a score.

Not for a launch discount, not for “early access”, not for a sponsored section. Every attempt we've had is documented in our public refusal log.

Why it matters If the score can be bought, the whole framework is a marketing brochure.

Publish a review we haven't hand-tested.

Even if a tool is famous, even if a vendor is screaming. If our reviewer hasn't actually lived with the tool, it doesn't enter the catalog — it sits in “awaiting test”.

Why it matters Reviews from people who haven't used the thing are the reason the internet can't tell you which AI tool to pick.

Let AI decide the score.

Claude grades alongside humans for calibration, but the published score is the mean of two human graders. Humans can be wrong. An LLM grading other LLMs is a feedback loop we won't take.

Why it matters If a model tests a model, the only thing you've tested is the tester.

Silently rewrite a verdict.

When a score moves, the old score stays on the page — crossed out, with the date, the reason, and the diff. You can see every change we've ever made to every review.

Why it matters A review that quietly updates is a review nobody can trust twice.

Methodology changelog

Every edit to the method, on the record.

Apr 12, 2026Battery v2.4 — retired 3 prompts all tools now ace; added 4 new stress prompts on long-context refactors.Affected categories: Coding. No scores re-ranked meaningfully.v2.4 · up

Feb 28, 2026Added Reliability as a 6th dimension (10% weight). Reduced Performance from 30% → 25%.Triggered by 4 production incidents across 3 tools in Q1. Full post-mortem published.v2.3

Jan 14, 2026Web-signals pipeline separated from RAI score.Previously signals nudged score by ±2. Now they're shown alongside but don't move the verdict.v2.2

Dec 02, 2025Two-grader minimum enforced. AI grader moved out of the mean.Based on calibration data showing AI grader systematically over-rewarded verbose outputs.v2.1

Oct 20, 2025Real-workflow score introduced (30% of dimension score).Previously 100% battery. Too many tools aced the battery and flopped in daily use.v2.0

Sep 03, 2025Framework v1.0 published.Five dimensions, single-grader, single-week trials. The version we later learned a lot of hard lessons from.v1.0

Disagree with a verdict? We want the argument.

Open the test log. Rerun the battery. Send us the diff. If you can show us where we got it wrong, we'll re-run, re-score, and credit you on the changelog.

Browse the catalog →methodology@reviewai.in

How we review · Methodology v2.4 · Apr 2026

We test every tool the same way, so you can trust the comparison.

Testing human hand, not a scraper

Prompts drafted from real use cases

Trials lived-with, not demoed

Graders humans + AI panel

Verdict one clear call, not a listicle

A score is only as good as the method behind it. This page is the method — every step, every weight, every thing we won't do. If you disagree with a verdict, this is where to start the argument.

Scope the category.

Owner · EditorialCadence · QuarterlyOutput · Category brief + prompt battery

Run the prompt battery.

Owner · ReviewersPer tool · Run twice, independentlyDeterminism · temp, seed, prompt fixed

Every tool in a category faces the same prompt set. Same wording. Same seed where supported. Two reviewers run them independently to catch flukes. Outputs go straight into the graded log.

ID	Prompt (abbrev.)	Tests	Pass criteria
cod_01	Refactor a 420-line Python file into typed modules.	Reasoning	Types correct · tests pass · no new bugs
cod_02	Trace and fix a race condition in this Go service.	Debugging	Root cause named · fix is correct · explains tradeoff
cod_07	Migrate a schema from Postgres → Planetscale with zero downtime.	Planning	Order of ops sound · rollback plan exists
cod_12	Add OAuth2 to this Rails app without breaking existing sessions.	Multi-file	Edits 4+ files correctly · migration reversible
cod_18	Summarize the test failures in this CI log and suggest next steps.	Triage	Right failures identified · actionable next step
cod_24	Write a Terraform module that satisfies [policy.yaml] constraints.	Synthesis	All 11 constraints met · no extra resources

Sample log entry — cod_02 · Claude Code

{
  "prompt_id": "cod_02",
  "tool": "claude-code-0.18",
  "temperature": 0.2,
  "seed": 42,
  "latency_ms": 8420,
  "tokens_out": 1142,
  "pass_criteria": {
    "root_cause_named": true,
    "fix_correct": true,
    "explains_tradeoff": true
  },
  "grader_1": 9, // Jordan
  "grader_2": 9, // Priya
  "grader_ai": 8
}

Use the tool for real work.

Owner · Reviewers + volunteersDuration · Lived-with, weeks not hoursSignal · “did I keep using it?”

Batteries catch raw capability. Daily use catches everything else: friction, fatigue, the small UX papercuts that make a tool quietly get ignored by week two.

Human graders. Two per tool.

Owner · ReviewersProcess · Independent · blind where possibleDisagreement rate · <8% after calibration

Cross-check against the web.

Owner · Signals pipelineSources · Reddit, X, Product Hunt, open webCache · 3–7 days

Live signals don't move the RAI score — they're shown next to it. If the web wildly disagrees with our verdict for more than a week, the editorial team triggers a re-review.

The signals pipeline

01 CollectParallel fetches: Reddit · X / Twitter · Product Hunt · vendor site · HN~8s

02 DedupeNear-duplicate filtering · spam/astroturf heuristics · engagement weighting~2s

03 StructureClaude Sonnet → JSON: sentiment, strengths, complaints, pricing moves~4s

04 CachePostgres + pgvector · 3–7 day TTL · invalidated on major vendor update—

05 SurfaceRendered beside the lab verdict · never folded into the RAI score—

The weights, visualized

On the browse page, you can slide these weights to your own use case and the catalog re-ranks live.

The RAI Framework, in full.

Version · v2.4Scale · 0–100 per dimensionWeights · sum to 100%

Performance

Dimension 01 of 06

How well the tool actually does the thing. Raw capability on the prompt battery + real-workflow outcomes.

Measured by avg battery score · pass rate on stress prompts · senior-grader confidence

25^%

Ease of use

Dimension 02 of 06

Interface, onboarding, friction per task. How much work you have to do to get the tool to do its work.

Measured by time-to-first-useful-output · clicks per task · UX complaint count

20^%

Learning curve

Dimension 03 of 06

How long until a new user is productive. Weighted separately from ease — a tool can be easy day one and plateau, or hard day one and keep rewarding you.

Measured by day-1 vs day-14 productivity delta · docs quality · first-week abandonment

15^%

Innovation

Dimension 04 of 06

Does the tool do something genuinely new, or is it a better wrapper around a commodity model? Rewards capability that shifts what's possible in the category.

Measured by capability unique to this tool · category-shift evidence · reviewer novelty score

15^%

Value for money

Dimension 05 of 06

Quality per dollar. Free plans count. Token economics count. Rewards fair pricing even at the top end.

Measured by quality index ÷ monthly cost for a realistic usage profile

15^%

Reliability

Dimension 06 of 06

Uptime, consistency of output, rate-limit predictability. The boring dimension that ruins everything when it's bad.

Measured by uptime over a rolling window · output variance at temp 0 · rate-limit incident count

10^%

The scoring math.

Transparency · Weights + worked exampleNormalisation · 0–100 per dimension

The RAI score is a weighted sum. No magic, no secret sauce, no vendor-specific bumps. Sliders on the browse page recompute it live with your own weights.

Worked example — one tool, coding category

Performance	94 × 0.25	= 23.5
Ease of use	93 × 0.20	= 18.6
Learning curve	91 × 0.15	= 13.6
Innovation	90 × 0.15	= 13.5
Value for money	84 × 0.15	= 12.6
Reliability	93 × 0.10	= 9.3
RAI overall= 91

# Each dimension blends the prompt battery and the real-workflow trial.
# Reviewer scores drive the mean; AI panel logged alongside for calibration.

Grader agreement · Q1 2026

Two humans, 93.2% within 1 point.

target ≥ 92%

High freqMediumLow72 graded prompts/week avg · 2 humans

Publish. Then keep publishing.

Owner · Editor-in-chiefReview cadence · Every 90 days or on major updateChangelog · Per-tool, public

The boundaries

Four things we will not do, ever.

Take vendor money to move a score.

Not for a launch discount, not for “early access”, not for a sponsored section. Every attempt we've had is documented in our public refusal log.

Why it matters If the score can be bought, the whole framework is a marketing brochure.

Publish a review we haven't hand-tested.

Even if a tool is famous, even if a vendor is screaming. If our reviewer hasn't actually lived with the tool, it doesn't enter the catalog — it sits in “awaiting test”.

Why it matters Reviews from people who haven't used the thing are the reason the internet can't tell you which AI tool to pick.

Let AI decide the score.

Claude grades alongside humans for calibration, but the published score is the mean of two human graders. Humans can be wrong. An LLM grading other LLMs is a feedback loop we won't take.

Why it matters If a model tests a model, the only thing you've tested is the tester.

Silently rewrite a verdict.

When a score moves, the old score stays on the page — crossed out, with the date, the reason, and the diff. You can see every change we've ever made to every review.

Why it matters A review that quietly updates is a review nobody can trust twice.

Methodology changelog

Every edit to the method, on the record.

Apr 12, 2026Battery v2.4 — retired 3 prompts all tools now ace; added 4 new stress prompts on long-context refactors.Affected categories: Coding. No scores re-ranked meaningfully.v2.4 · up

Feb 28, 2026Added Reliability as a 6th dimension (10% weight). Reduced Performance from 30% → 25%.Triggered by 4 production incidents across 3 tools in Q1. Full post-mortem published.v2.3

Jan 14, 2026Web-signals pipeline separated from RAI score.Previously signals nudged score by ±2. Now they're shown alongside but don't move the verdict.v2.2

Dec 02, 2025Two-grader minimum enforced. AI grader moved out of the mean.Based on calibration data showing AI grader systematically over-rewarded verbose outputs.v2.1

Oct 20, 2025Real-workflow score introduced (30% of dimension score).Previously 100% battery. Too many tools aced the battery and flopped in daily use.v2.0

Sep 03, 2025Framework v1.0 published.Five dimensions, single-grader, single-week trials. The version we later learned a lot of hard lessons from.v1.0

Disagree with a verdict? We want the argument.

Open the test log. Rerun the battery. Send us the diff. If you can show us where we got it wrong, we'll re-run, re-score, and credit you on the changelog.

Browse the catalog →methodology@reviewai.in