Between April and June 2026, China launched a wave of language models designed for a single mission: agentic coding. GLM-5.1, Qwen3.7-Max, Kimi K2.6, MiniMax M3, MiMo-V2.5-Pro, DeepSeek V4 Pro, and V4 Flash directly compete with Claude Opus 4.8 on the benchmarks that matter most for autonomous software development. This comparison analyzes them one by one, with data updated to June 2026.
The context: a historic concentration of releases
In just three months, seven Chinese models have hit the market with a common denominator: all are explicitly positioned for agentic coding — the ability of a model to write, debug, and optimize code autonomously during long sessions, using tools, iterating on results, and maintaining coherence across hundreds or thousands of calls.
The Western reference point is Claude Opus 4.8, released by Anthropic on May 28, 2026, which raised the SWE-Bench Pro standard to 69.2%. But Chinese models are closing the gap at an accelerated pace — and at prices that make them hard to ignore.
What exactly does SWE-Bench Pro measure?
Before diving into the numbers, a necessary clarification. SWE-Bench Pro evaluates a model’s ability to solve real bugs in open-source repositories: the model receives a problem description, explores the codebase, identifies the root cause, and proposes a patch. It is the metric closest to what a human developer does daily. However, each model is evaluated with different scaffolds (the agent system that orchestrates the tools), so direct comparisons between labs are directional, not absolute.
The contenders, in performance order
Claude Opus 4.8 — the standard to beat
Anthropic released Opus 4.8 as a “modest but tangible” improvement over Opus 4.7, and the numbers confirm it: 69.2% on SWE-Bench Pro, up from 64.3% for its predecessor. That’s an improvement of nearly 5 percentage points in just one month. But the most interesting data point isn’t in the benchmarks: Anthropic claims Opus 4.8 is four times less likely to let bugs pass in its own code without reporting them. For teams relying on autonomous agents for legacy codebases, this is as important as any performance metric.
Pricing remains at $5 per million input tokens and $25 per million output tokens — the same as Opus 4.7. It is the most expensive model in this comparison by a wide margin.
Qwen3.7-Max — the Chinese leader in coding
Alibaba’s flagship model arrived on May 19 with performance that surprised even the most attentive analysts. It reaches 60.6% on SWE-Bench Pro, surpassing all Chinese models and Opus 4.6 (57.3%). But where it truly shines is Terminal-Bench 2.0, where its score of 69.7 beats all competitors, including Opus 4.6 (65.4%) and DeepSeek V4 Pro (67.9%). Terminal-Bench measures real terminal tasks: package installation, process debugging, network configuration. It is the benchmark closest to a developer’s everyday work.
The most impressive demonstration of Qwen3.7-Max was a 35-hour autonomous kernel optimization with over 1,000 tool calls, completed without human intervention. Alibaba also demonstrated that the model generalizes across different scaffolds (Claude Code, OpenClaw, Qwen Code) with consistent results.
Its current price is $1.25 per million input tokens and $3.75 per million output tokens, thanks to a 50% promotion valid until June 22. The normal price is double.
Kimi K2.6 — the open-weight model closest to the frontier
Moonshot AI released Kimi K2.6 on April 20 as an open-weight model with approximately 1 trillion parameters. Its 59.5% score on SWE-Bench Pro places it slightly behind Qwen3.7-Max, but ahead of Opus 4.6 (57.3%). On SWE-Bench Verified it reaches 80.2%, essentially tied with Opus 4.6 (80.8%).
Kimi K2.6 is explicitly designed for proactive agents that operate 24/7 without human supervision. Moonshot AI reports a 96.6% tool invocation rate and a 50% improvement in web application generation with Next.js compared to its predecessor K2.5. Integrators such as CodeBuddy and Augment Code confirm that the model is especially adept at intelligently pivoting when an initial approach fails.
Its price is $0.75 per million input tokens and $3.50 per million output tokens. It is available with open weights under a Modified MIT license.
DeepSeek V4 Pro — the king of value for money
DeepSeek V4 Pro, released on April 24, is a 1.6 trillion parameter model (49 billion active) with 1 million token context and an MIT license. Its 80.6% score on SWE-Bench Verified is just 0.2 points shy of Opus 4.6 (80.8%), and its 93.5 on LiveCodeBench is the highest of any model to date.
But the data point that has most shaken the market is its price. DeepSeek applied a 75% discount that was later made permanent: output cost is $0.87 per million tokens, compared to $25 for Opus 4.8. That’s 29 times cheaper for comparable coding performance. Its hybrid CSA+HCA architecture reduces FLOPs to 27% and KV cache to 10% of what the previous generation required.
MiniMax M3 — the newcomer promising to revolutionize
The most recent release in this comparison — June 1, 2026 — is also one of the most ambitious. MiniMax M3 is the first open-weight model to combine frontier coding, 1 million token context, and multimodal capability (text, image, and video) in a single system.
Its figures: 59.0% on SWE-Bench Pro, 83.5 on BrowseComp (surpassing Opus 4.7, which scored 79.3), and 66.0% on Terminal-Bench 2.1. MiniMax claims it outperforms GPT-5.5 and Gemini 3.1 Pro in coding, though independent validation is still underway due to the recency of the release.
M3’s true differentiator is its MSA (MiniMax Sparse Attention) architecture, which replaces full attention with block KV selection. This makes the 1 million token context practical: prefill is 9x faster, decoding is 15x faster, and computation per token is reduced to one-tenth compared to the previous generation.
At promotional pricing, MiniMax M3 costs $0.30 per million input tokens and $1.20 per million output tokens. It is 21 times cheaper than Opus 4.8, and being open-weight, it allows self-hosting.
GLM-5.1 — the multi-iteration optimization specialist
The model from Z.ai (formerly Zhipu AI), released on April 7, was trained entirely on Huawei Ascend chips, making it a symbol of Chinese technological sovereignty. Its 58.4% score on SWE-Bench Pro trails the leaders, but it has a unique quality: it is designed not to plateau.
GLM-5.1’s most telling demonstration is a vector database optimization task in Rust. In a normal 50-turn session, it achieved around 3,500 queries per second — comparable to Opus 4.6. But in an optimization loop of 600 iterations with over 6,000 tool calls, it reached 21,500 queries per second: six times more. While other models stagnate after the first few iterations, GLM-5.1 continues to find structural improvements.
Its price is approximately $0.98 per million input tokens and $3.08 per million output tokens on OpenRouter. The context is limited to 203,000 tokens, significantly less than the 1 million offered by competitors.
MiMo-V2.5-Pro — the compiler builder
Xiaomi entered the language model market with MiMo-V2.5-Pro, a 1.02 trillion parameter model (42 billion active) with an MIT license and 1 million token context. Its benchmark scores are modest (57.2% on SWE-Bench Pro, 78.9% on Verified), but its strength lies elsewhere.
Xiaomi demonstrated that MiMo-V2.5-Pro built a complete SysY compiler in Rust — a project that takes a computer science student weeks — in 4.3 hours with 672 tool calls, achieving a perfect score of 233/233 on the test suite. It is the ideal model for infrastructure tasks requiring long, autonomous sessions.
Its price is $0.435 per million input tokens and $0.87 per million output tokens, tying with DeepSeek V4 Pro as the second cheapest. Its generation speed is low (42 tokens per second) and it tends to be verbose, but for tasks that prioritize correctness over speed, it is a solid choice.
DeepSeek V4 Flash — the ultra-economical
If DeepSeek V4 Pro revolutionized the value-for-money equation, V4 Flash completely redefined it. With 284 billion total parameters (13 billion active) and an output price of $0.28 per million tokens, it delivers 79.0% on SWE-Bench Verified. That’s just 1.6 percentage points less than V4 Pro, for roughly one-third the price.
To put it in perspective: V4 Flash costs approximately 90 times less than Claude Opus 4.8 in output tokens, with a coding performance gap that many teams would consider acceptable. For startups, small teams, or tasks that require processing millions of tokens without worrying about cost, V4 Flash is arguably the best price-to-performance model ever released.
Like V4 Pro, it has an MIT license, open weights, and 1 million token context.
Benchmark comparison table
| Model | SWE-Bench Pro | SWE-Bench Verified | Terminal-Bench 2.0 | Output price/1M |
|---|---|---|---|---|
| Claude Opus 4.8 | 69.2% 🏆 | 80.8% (Opus 4.6) | 65.4% (Opus 4.6) | $25.00 |
| Qwen3.7-Max | 60.6% | 80.4% | 69.7% 🏆 | $3.75 (promo) |
| Kimi K2.6 | 59.5% | 80.2% | 66.7% | $3.50 |
| MiniMax M3 | 59.0% | — | 66.0% | $1.20 (promo) |
| DeepSeek V4 Pro | 59.0% | 80.6% | 67.9% | $0.87 💸 |
| GLM-5.1 | 58.4% | — | 63.5% | $4.40 |
| MiMo-V2.5-Pro | 57.2% | 78.9% | 68.4% | $0.87 |
| DeepSeek V4 Flash | — | 79.0% | — | $0.28 💸 |
Note: figures come from each lab’s official reports and may use different methodologies. † = data from Opus 4.6, not 4.8. 💸 = permanent reduced price.
Which model to choose by use case
For daily coding (PRs, fixes, features): DeepSeek V4 Pro offers the best balance between performance (80.6% Verified) and price ($0.87/1M output). If the budget is minimal, V4 Flash at $0.28 is an incredibly powerful option.
For infrastructure and long-duration tasks: MiMo-V2.5-Pro demonstrated it can complete complex projects like a compiler autonomously in hours. GLM-5.1 is the alternative if the task requires sustained iterative optimization.
For 24/7 autonomous agents: Kimi K2.6 is explicitly designed for this use case, with a 96.6% tool invocation rate and the ability to orchestrate heterogeneous agents.
For autonomous web navigation: MiniMax M3 leads with 83.5 on BrowseComp, surpassing Opus 4.7. Its 1M context at minimal pricing makes it ideal for tasks requiring reading and processing large volumes of web information.
For high-trust work on legacy codebases: Claude Opus 4.8 remains the safest choice. Its honesty — four times less likely to ignore bugs in its own code — and its mature ecosystem (Claude Code, MCP, refined tool use) justify the premium price when the cost of error exceeds the API cost.
The Chinese price war intensifies
Beyond the benchmarks, there is a trend worth noting: Chinese models are competing not only in performance, but through an aggressive low-price strategy. DeepSeek made its permanent 75% discount on V4 Pro in May. Xiaomi entered the API market with prices that undercut everyone. Alibaba offers 50% off Qwen3.7-Max. MiniMax M3 launches with promotional prices that are a fraction of Western leaders.
The result is a market where frontier-performing models are accessible for less than $1 per million output tokens. A year ago, that seemed impossible.
What Claude Opus 4.8 still does better
Despite the narrowing gap, Claude Opus 4.8 maintains qualitative advantages that benchmarks do not fully capture. Honesty in coding — reporting bugs rather than ignoring them — is a significant improvement for autonomous development. Claude Code’s dynamic workflows allow running parallel agents for codebase-scale work. And Anthropic’s ecosystem, with MCP and refined tool use, remains more mature than the Chinese alternatives.
For enterprises where the cost of a production error far exceeds the API cost, Opus 4.8 remains the right choice. For everyone else, Chinese models offer an increasingly hard-to-ignore alternative.
Main source: GLM-5.1 — Z.ai | Qwen3.7 — Alibaba | Kimi K2.6 — Moonshot AI | MiniMax M3 — MiniMax | MiMo-V2.5-Pro — Xiaomi | DeepSeek V4 — DeepSeek | Claude Opus 4.8 — Anthropic