The Problem with Measuring Coding Agents: Inflated Benchmarks, Cheating, and What Really Matters
SWE-bench is no longer reliable. Models "cheat" by accessing git history. Scores with optimized scaffolding don't reflect real capability. A practical guide to understanding coding agent benchmarks without being fooled by the numbers.