Review.AI
HomeBrowseAskFrameworkAbout
Sign inAsk Review.AIAsk→
How we review · Methodology v2.4 · Apr 2026

We test every tool the same way, so you can trust the comparison.

Testing human hand, not a scraper
Prompts drafted from real use cases
Trials lived-with, not demoed
Graders humans + AI panel
Verdict one clear call, not a listicle
A score is only as good as the method behind it. This page is the method — every step, every weight, every thing we won't do. If you disagree with a verdict, this is where to start the argument.
Contents
01 › Category scoping02 › Prompt battery03 › Real workflows04 › Human grading05 › Web signals06 › The RAI Framework07 › Scoring math08 › Publishing
01

Scope the category.

Owner · EditorialCadence · QuarterlyOutput · Category brief + prompt battery

Before any tool is tested, the editors write a category brief: what the tools in this category are actually trying to do, what a good outcome looks like, and what the category edge cases are. The AI coding assistant brief looks nothing like the voice AI brief — and that's the point.

The brief defines the prompt battery: real-world use cases the graders have actually hit in their own work — engineering, writing, research. Every task is drafted from something someone needed to do for real, then tightened to stress a specific dimension. We retire prompts as tools start to ace them and add new ones as the category moves.

02

Run the prompt battery.

Owner · ReviewersPer tool · Run twice, independentlyDeterminism · temp, seed, prompt fixed

Every tool in a category faces the same prompt set. Same wording. Same seed where supported. Two reviewers run them independently to catch flukes. Outputs go straight into the graded log.

IDPrompt (abbrev.)TestsPass criteria
cod_01Refactor a 420-line Python file into typed modules.ReasoningTypes correct · tests pass · no new bugs
cod_02Trace and fix a race condition in this Go service.DebuggingRoot cause named · fix is correct · explains tradeoff
cod_07Migrate a schema from Postgres → Planetscale with zero downtime.PlanningOrder of ops sound · rollback plan exists
cod_12Add OAuth2 to this Rails app without breaking existing sessions.Multi-fileEdits 4+ files correctly · migration reversible
cod_18Summarize the test failures in this CI log and suggest next steps.TriageRight failures identified · actionable next step
cod_24Write a Terraform module that satisfies [policy.yaml] constraints.SynthesisAll 11 constraints met · no extra resources

Sample prompts shown above. The full battery is kept internal for now — every prompt, temperature, seed, and pass criterion is logged against the same rubric for every tool. We'll open up the battery and its logs category by category as each one stabilises, so readers can inspect and reproduce what we ran.

Sample log entry — cod_02 · Claude Code
{
  "prompt_id": "cod_02",
  "tool": "claude-code-0.18",
  "temperature": 0.2,
  "seed": 42,
  "latency_ms": 8420,
  "tokens_out": 1142,
  "pass_criteria": {
    "root_cause_named": true,
    "fix_correct": true,
    "explains_tradeoff": true
  },
  "grader_1": 9, // Jordan
  "grader_2": 9, // Priya
  "grader_ai": 8
}
03

Use the tool for real work.

Owner · Reviewers + volunteersDuration · Lived-with, weeks not hoursSignal · “did I keep using it?”

Batteries catch raw capability. Daily use catches everything else: friction, fatigue, the small UX papercuts that make a tool quietly get ignored by week two.

Each tool gets plugged into a real workflow — typically a live team's Slack, Notion, or codebase — and left there for two weeks. The reviewer journals every session. At the end, the team answers one question: did you keep using it voluntarily?

04

Human graders. Two per tool.

Owner · ReviewersProcess · Independent · blind where possibleDisagreement rate · <8% after calibration

Every prompt output is graded independently by two reviewers per tool on a 1–10 rubric. An AI panel (Claude alongside peer models, same rubric) runs in parallel as a calibration check — not as a vote.

If the two reviewers disagree by 2+ points, a third breaks the tie and we flag the prompt for battery review. We publish the full disagreement rate with every scorecard. Calibration is an editorial KPI.

05

Cross-check against the web.

Owner · Signals pipelineSources · Reddit, X, Product Hunt, open webCache · 3–7 days

In parallel with the lab work, our signals pipeline pulls fresh discussion about each tool across the open web. Claude structures the raw text into sentiment, top strengths, most-complained-about issues, and pricing moves.

Live signals don't move the RAI score — they're shown next to it. If the web wildly disagrees with our verdict for more than a week, the editorial team triggers a re-review.

The signals pipeline
01 CollectParallel fetches: Reddit · X / Twitter · Product Hunt · vendor site · HN~8s
02 DedupeNear-duplicate filtering · spam/astroturf heuristics · engagement weighting~2s
03 StructureClaude Sonnet → JSON: sentiment, strengths, complaints, pricing moves~4s
04 CachePostgres + pgvector · 3–7 day TTL · invalidated on major vendor update—
05 SurfaceRendered beside the lab verdict · never folded into the RAI score—
The weights, visualized
PERF · 25EASE · 20LEARN · 15INNOV · 15VALUE · 15REL · 10

The six dimensions aren't equal, because in daily use they aren't equal. Performance carries the most weight, followed by the friction dimensions (ease, learning curve) that decide whether a tool sticks.

On the browse page, you can slide these weights to your own use case and the catalog re-ranks live.

06

The RAI Framework, in full.

Version · v2.4Scale · 0–100 per dimensionWeights · sum to 100%

Six dimensions. The weights below are our defaults — what we think matters most for a general reader. On the browse page, you can slide the weights to your own use case and the catalog re-ranks live.

Performance
Dimension 01 of 06
How well the tool actually does the thing. Raw capability on the prompt battery + real-workflow outcomes.
Measured by avg battery score · pass rate on stress prompts · senior-grader confidence
25%
Ease of use
Dimension 02 of 06
Interface, onboarding, friction per task. How much work you have to do to get the tool to do its work.
Measured by time-to-first-useful-output · clicks per task · UX complaint count
20%
Learning curve
Dimension 03 of 06
How long until a new user is productive. Weighted separately from ease — a tool can be easy day one and plateau, or hard day one and keep rewarding you.
Measured by day-1 vs day-14 productivity delta · docs quality · first-week abandonment
15%
Innovation
Dimension 04 of 06
Does the tool do something genuinely new, or is it a better wrapper around a commodity model? Rewards capability that shifts what's possible in the category.
Measured by capability unique to this tool · category-shift evidence · reviewer novelty score
15%
Value for money
Dimension 05 of 06
Quality per dollar. Free plans count. Token economics count. Rewards fair pricing even at the top end.
Measured by quality index ÷ monthly cost for a realistic usage profile
15%
Reliability
Dimension 06 of 06
Uptime, consistency of output, rate-limit predictability. The boring dimension that ruins everything when it's bad.
Measured by uptime over a rolling window · output variance at temp 0 · rate-limit incident count
10%
07

The scoring math.

Transparency · Weights + worked exampleNormalisation · 0–100 per dimension

The RAI score is a weighted sum. No magic, no secret sauce, no vendor-specific bumps. Sliders on the browse page recompute it live with your own weights.

Worked example — one tool, coding category
Performance94 × 0.25= 23.5
Ease of use93 × 0.20= 18.6
Learning curve91 × 0.15= 13.6
Innovation90 × 0.15= 13.5
Value for money84 × 0.15= 12.6
Reliability93 × 0.10=  9.3
RAI overall= 91
# Each dimension blends the prompt battery and the real-workflow trial.
# Reviewer scores drive the mean; AI panel logged alongside for calibration.
Reweight any dimension on browse and this total recomputes live. If a dimension can't be measured yet (e.g. a tool that's only weeks old has no reliability window), we publish “n/a” and reweight the remaining dimensions proportionally — surfaced explicitly on the scorecard.
Grader agreement · Q1 2026
Two humans, 93.2% within 1 point.
target ≥ 92%
Δ ABS0 PT1 PT2 PT3+ PT14216818412611413215614812210838445232283642403026681154697431231013210JAN W1W2W3W4FEB W1W2W3W4MAR W1W2
High freqMediumLow72 graded prompts/week avg · 2 humans
08

Publish. Then keep publishing.

Owner · Editor-in-chiefReview cadence · Every 90 days or on major updateChangelog · Per-tool, public

A review is never “done.” We re-run the battery on every major vendor release, and at minimum every 90 days. Scores move. Verdicts get rewritten. The per-tool changelog records every edit, with reasons.

The boundaries

Four things we will not do, ever.

✕

Take vendor money to move a score.

Not for a launch discount, not for “early access”, not for a sponsored section. Every attempt we've had is documented in our public refusal log.

Why it matters If the score can be bought, the whole framework is a marketing brochure.
✕

Publish a review we haven't hand-tested.

Even if a tool is famous, even if a vendor is screaming. If our reviewer hasn't actually lived with the tool, it doesn't enter the catalog — it sits in “awaiting test”.

Why it matters Reviews from people who haven't used the thing are the reason the internet can't tell you which AI tool to pick.
✕

Let AI decide the score.

Claude grades alongside humans for calibration, but the published score is the mean of two human graders. Humans can be wrong. An LLM grading other LLMs is a feedback loop we won't take.

Why it matters If a model tests a model, the only thing you've tested is the tester.
✕

Silently rewrite a verdict.

When a score moves, the old score stays on the page — crossed out, with the date, the reason, and the diff. You can see every change we've ever made to every review.

Why it matters A review that quietly updates is a review nobody can trust twice.
Methodology changelog

Every edit to the method, on the record.

Apr 12, 2026Battery v2.4 — retired 3 prompts all tools now ace; added 4 new stress prompts on long-context refactors.Affected categories: Coding. No scores re-ranked meaningfully.v2.4 · up
Feb 28, 2026Added Reliability as a 6th dimension (10% weight). Reduced Performance from 30% → 25%.Triggered by 4 production incidents across 3 tools in Q1. Full post-mortem published.v2.3
Jan 14, 2026Web-signals pipeline separated from RAI score.Previously signals nudged score by ±2. Now they're shown alongside but don't move the verdict.v2.2
Dec 02, 2025Two-grader minimum enforced. AI grader moved out of the mean.Based on calibration data showing AI grader systematically over-rewarded verbose outputs.v2.1
Oct 20, 2025Real-workflow score introduced (30% of dimension score).Previously 100% battery. Too many tools aced the battery and flopped in daily use.v2.0
Sep 03, 2025Framework v1.0 published.Five dimensions, single-grader, single-week trials. The version we later learned a lot of hard lessons from.v1.0

Disagree with a verdict? We want the argument.

Open the test log. Rerun the battery. Send us the diff. If you can show us where we got it wrong, we'll re-run, re-score, and credit you on the changelog.

Browse the catalog →methodology@reviewai.in
Review.AI
© 2026 · reviewai.in · The trusted layer for AI tools
Incubated atNSRCEL, IIM Bangalore
MethodologyThe RAI FrameworkSubmit a toolJournalAPI access
AboutCareersPressPrivacyTerms