SWE-bench Verified hit its ceiling. That is useful information.
OpenAI says it has stopped reporting SWE-bench Verified for frontier coding models. Builders should read that less as drama and more as a reminder: benchmark confidence expires.
news, tips, and reviews that make thinking machines useful
XTag archive
Everything we’ve published under Benchmarks so far.
OpenAI says it has stopped reporting SWE-bench Verified for frontier coding models. Builders should read that less as drama and more as a reminder: benchmark confidence expires.