AI benchmarks

What are AI benchmarks?

AI benchmarks are standardised tests that measure model performance on defined tasks. MMLU tests general knowledge across 57 subjects. HumanEval tests code generation. MATH tests mathematical reasoning. Each produces a score that enables comparison across models.

The case for benchmarks: they are reproducible, comparable, and version-controlled. The same test applied to different models gives a direct signal about relative capability. Without benchmarks, model comparison would be purely anecdotal.

The case against: benchmarks can be gamed. Models can be trained on benchmark data (contamination). High scores on HumanEval do not guarantee good real-world code. And as models approach ceiling scores, benchmarks lose discriminative power.

Why it matters

sourc.dev publishes benchmark scores as verified attributes — not as rankings. The scores are data points, not verdicts. A high MMLU score is useful context. It is not a recommendation. The methodology page at /methodology documents how each benchmark is sourced and verified.

What are AI benchmarks?

Related terms