Tag archive

Benchmarks

Everything we’ve published under Benchmarks so far.

Follow this lane

1 Useful Machines post on Benchmarks

Benchmarks readers are already filtering for a specific AI topic, which makes this archive a useful audience signal for sponsors and repeat readers.

2026-03-01 By Owen Pike 3 min read

SWE-bench Verified maxed out, and it's time to build your own private coding evals

OpenAI is moving on from SWE-bench Verified because the benchmark has degraded. It’s a harsh reminder that public leaderboards cannot replace private evaluations based on your actual codebase.

Benchmarks SWE-bench Coding Agents OpenAI Developer Tools