Scope the category.
Before any tool is tested, the editors write a category brief: what the tools in this category are actually trying to do, what a good outcome looks like, and what the category edge cases are. The AI coding assistant brief looks nothing like the voice AI brief — and that's the point.
The brief defines the prompt battery: real-world use cases the graders have actually hit in their own work — engineering, writing, research. Every task is drafted from something someone needed to do for real, then tightened to stress a specific dimension. We retire prompts as tools start to ace them and add new ones as the category moves.