On June 1, 2026, Alibaba launched Qwen 3.7 Plus, and suddenly the conversation about which model to use for agentic coding was completely reordered. Not because it leads every benchmark — it doesn’t — but because it combines three things that until now did not coexist in a single model: low price, multimodal vision, and an autonomy ceiling of 35 hours. Eleven days earlier, Qwen 3.7 Max had arrived, the text-only flagship. Two months earlier, DeepSeek V4 Pro had shaken up the market with open weights and unbeatable prices. And barely a week earlier, Anthropic had launched Claude Opus 4.8, the new king of SWE-Bench Pro. All are competing for the same space. Each excels at something different.
This is the comparison that brings together all five contenders across the dimensions that matter for development teams: coding benchmarks, agentic capabilities, price, and which workload each one suits best.
Technical specifications
| Feature | Qwen 3.7 Plus | Qwen 3.6 Plus | Qwen 3.7 Max | DeepSeek V4 Pro | Claude Opus 4.8 |
|---|---|---|---|---|---|
| Release | Jun 2026 | Apr 2026 | May 2026 | Apr 2026 | May 2026 |
| Modality | Text + Image + Video | Text + Image | Text only | Text only | Text only |
| Context | 1M tokens | 1M tokens | 1M tokens | 1M tokens | 200K tokens |
| Max output | — | — | 65,536 tok | 384,000 tok | — |
| Parameters | API-only (proprietary) | API-only (proprietary) | API-only (proprietary) | 1.6T / 49B active (MIT) | API-only (closed) |
| Autonomy | 35h / 1000+ tools | ~35h | 35h / 1158 tools | Not specified | Not specified |
| Self-hosting | ❌ | ❌ | ❌ | ✅ (MIT) | ❌ |
Pricing per million tokens
| Model | Input | Output | Cached input | Ratio vs cheapest |
|---|---|---|---|---|
| DeepSeek V4 Pro (OpenRouter) | $0.435 | $0.87 | ~$0.014 (cache hit) | 1× |
| Qwen 3.6 Plus (OpenRouter) | $0.325 | $1.95 | — | 2.24× |
| Qwen 3.7 Plus (OpenRouter) | $0.40 | $1.60 | $0.08 | 1.84× |
| Qwen 3.7 Max (OpenRouter) | $1.25 | $3.75 | $0.25 | 4.31× |
| Claude Opus 4.8 (direct) | $5.00 | $25.00 | $0.50 | 28.7× |
The strongest takeaway from this table: DeepSeek V4 Pro costs 29 times less than Opus 4.8 on output, while delivering competitive coding performance. Qwen 3.7 Plus lands at an interesting midpoint: cheaper than its predecessor (3.6 Plus) on output ($1.60 vs $1.95), and drastically cheaper than its bigger sibling Max ($1.60 vs $7.50).
Coding benchmarks
| Benchmark | Qwen 3.7 Plus | Qwen 3.6 Plus | Qwen 3.7 Max | DeepSeek V4 Pro | Opus 4.8 |
|---|---|---|---|---|---|
| SWE-Bench Pro | ~60% | 56.6% | 60.6% | 59.0% | 69.2% 🏆 |
| SWE-Bench Verified | ~79%* | 78.8% | 80.4% | 80.6% | — |
| Terminal-Bench 2.0 | — | 61.6% | 69.7% 🏆 | 67.9% | — |
| LiveCodeBench | — | — | — | 93.5% 🏆 | — |
*Estimated. Qwen 3.7 Plus shares its text backbone with 3.7 Max.
What each benchmark says:
- SWE-Bench Pro: The hardest. Real multi-file bugs in open repos. Opus 4.8 dominates with 69.2%, but costs 15× more than Qwen 3.7 Plus and 29× more than DeepSeek.
- SWE-Bench Verified: More accessible tasks. DeepSeek V4 Pro (80.6%) and Qwen 3.7 Max (80.4%) are essentially tied. Qwen 3.6 Plus trails ~2 points behind.
- Terminal-Bench 2.0: Agentic shell execution. Qwen 3.7 Max (69.7) leads, followed by DeepSeek (67.9). Qwen 3.6 Plus (61.6) lags behind.
- LiveCodeBench: Competitive coding. DeepSeek V4 Pro (93.5%) crushes Opus 4.7 (84.7%) and GPT-5.5 (85.3%).
Agentic benchmarks (tool calling)
| Benchmark | Qwen 3.7 Plus* | Qwen 3.6 Plus | Qwen 3.7 Max | DeepSeek V4 Pro | Opus 4.6 Max† |
|---|---|---|---|---|---|
| BFCL-V4 (tool calling) | ~74 | 68.9 | 75.0 | 70.6 | 76.7 🏆 |
| MCP-Mark (MCP tools) | — | 48.2 | 60.8 | 57.1 | 56.7 |
| MCP-Atlas (MCP ecosystem) | 76.4 | 74.1 | 76.4 | 73.6 | 75.8 |
*Estimated. Qwen 3.7 Plus inherits the agentic backbone from 3.7 Max. †Data for Opus 4.6, not 4.8 (4.8 data not available for BFCL/MCP).
Qwen 3.7 Max leads in tool calling among Chinese models. DeepSeek V4 Pro is a few points behind. Qwen 3.7 Plus, sharing the same agentic stack, should be close to Max. On MCP-Atlas, Plus and Max get exactly the same score (76.4), suggesting the agentic backbone is identical.
Multimodal capabilities
| Capability | Qwen 3.7 Plus | Qwen 3.6 Plus | Qwen 3.7 Max | DeepSeek V4 Pro | Opus 4.8 |
|---|---|---|---|---|---|
| Vision (image) | ✅ | ✅ | ❌ | ❌ | ❌* |
| Video | ✅ | ❌ | ❌ | ❌ | ❌ |
| Computer use | ✅ (GUI navigation) | — | ❌ | ❌ | ✅ (beta) |
| Vision Arena rank | #16 🏆 | — | N/A | N/A | N/A |
*Opus 4.8 has computer use to view screens, but not native multimodal vision.
Qwen 3.7 Plus is the only model in this comparison with native vision (image + video) at text-level pricing. This changes the calculus for teams working with visual references: UI mockups, bug screenshots, wireframes. DeepSeek V4 Pro and Qwen 3.7 Max cannot do this. Opus 4.8 can view screens through computer use, but that is an additional layer, not native multimodal processing.
Decision matrix
| Scenario | Recommended model | Why |
|---|---|---|
| Daily driver for coding (default) | Qwen 3.7 Plus | Price-performance-vision balance. 6× cheaper than Max, sees images, same autonomy ceiling. |
| Maximum coding performance | Opus 4.8 | SWE-Bench Pro 69.2%. For deep debugging where cost doesn't matter. |
| Tight budget, high volume | DeepSeek V4 Pro | Output at $0.87/M, open-weight, LiveCodeBench 93.5%. Best quality-price ratio for coding. |
| Extreme autonomy (24h+ pipelines) | Qwen 3.7 Plus / Max | 35h with >1000 verified tool calls. No other model has this documented. |
| Self-hosting / privacy | DeepSeek V4 Pro (MIT) or Qwen 35B-A3B | Open weights. Qwen 3.7 Plus and Max are API-only. |
| Coding with visual references | Qwen 3.7 Plus | Only one with native vision at a competitive price. |
| Intensive tool calling | Qwen 3.7 Max / Plus | BFCL-V4 75.0, MCP-Atlas 76.4. Lead in tool calling among non-Anthropic models. |
Verdict
There is no absolute winner, but there is a clear trend. Qwen 3.7 Plus is the most balanced model on the market today for development teams doing agentic coding. It doesn’t lead any individual benchmark — Opus 4.8 wins at SWE-Bench Pro, DeepSeek V4 Pro wins on price and LiveCodeBench, Qwen 3.7 Max wins at Terminal-Bench and tool calling. But Plus is the only one that simultaneously covers solid code performance, low price, multimodal vision, and extreme autonomy.
The era of the “best model” is over. Now it’s about choosing the right combination of attributes for each workload.
Main source: Qwen 3.7 Plus vs Qwen 3.7 Max — ofox.ai