Many SWE-bench-Passing PRs Would Not Be Merged into Main

A METR study found a gap between AI-generated code passing automated benchmarks and its acceptance by human maintainers. Roughly half of the AI-generated PRs that passed SWE-bench automated tests were rejected by repository maintainers, even after adjusting for review noise. Maintainers cited issues like poor code quality, breaking other code, or core functionality failures as reasons for rejection.