Resource Automation + AI Audits + Accountability

What Makes a Good AI Benchmark?

This brief presents a novel assessment framework for evaluating the quality of AI benchmarks and scores 24 benchmarks against the framework.

The paper, “BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices,” develops an assessment framework that considers 46 best practices across a benchmark’s life cycle, drawing on expert interviews and domain literature. They evaluate 24 AI benchmarks—16 FM and 8 non-FM benchmarks—against this framework, noting quality differences across the two types of benchmarks. Looking forward, they propose a minimum quality assurance checklist to support test developers seeking to adopt best practices.