Qwen 3.7 Max: The Most Unbalanced AI Model

There are AI models that are good at everything and masters of nothing. And then there’s Qwen 3.7 Max, which is an astonishing programmer and, according to those who’ve thoroughly tested it, one of the worst creative writers out there. The imbalance isn’t an accident — it’s a design decision.

Alibaba launched Qwen 3.7 Max on May 20, 2026 at the Alibaba Cloud Summit in Hangzhou, and made it clear from the start that this isn’t a generic conversational assistant. It’s an “agent foundation” — a base for building autonomous agents — optimized for coding, office automation, and long-duration task execution. It competes directly with GPT-5.5, Claude Opus 4.7, and Gemini 3.5 Flash. And in code, it genuinely competes.

The numbers speak for themselves. On SWE-Bench Verified — the standard for measuring software engineering problem-solving ability — Qwen 3.7 Max scored 80.4%. To put that in context, it’s on par with Claude Opus 4.6 Max (80.8%) and DeepSeek V4 Pro Max (80.6%), though behind GPT-5.5 (88.7%). On SWE-Pro it reached 60.6%, and on SWE-Multilingual 78.3%. On Terminal-Bench 2.0, which measures command-line skills, it scored 69.7% — the best among compared models.

Where it truly shines is in prolonged autonomous execution. Alibaba demonstrated a continuous kernel optimization session that lasted 35 hours, during which the model made 1,158 tool calls and 432 code evaluations, achieving a 10x speedup over a baseline Triton kernel on a hardware architecture it had never seen before. That’s not an academic benchmark — it’s a demonstration of what it means to have a synthetic software engineer working double shifts without sleep.

But the Achilles’ heel is just as remarkable. YouTuber ServeNoMaster, after extensive testing, described it as “one of the best models I’ve tested on the technical side and one of the weakest I’ve tested in creative writing.” The title of his video calls it “the most unbalanced AI model.” This isn’t a hidden flaw: Alibaba designed the model for one thing (code agents and productivity) and sacrificed everything else. If you need an assistant that can also write poetry, this isn’t your model.

The price looks attractive: $2.50 per million input tokens, $7.50 per million output tokens — half of what Claude Opus 4.7 or GPT-5.5 costs ($5/$15). But here’s the catch: Qwen 3.7 Max is extremely verbose. According to Artificial Analysis, during its evaluation it generated 97 million output tokens, compared to an average of 35 million for comparable models. Nearly three times more talkative than the competition.

That verbosity combines explosively with the prompt caching system. The model allows caching long contexts to save costs, but with rules that can work against you: cache creation costs 125% of the standard price, the cache TTL is only 5 minutes, and if you don’t correctly configure the cache_control markers, you end up paying for creation over and over. Reddit users report massive unexpected bills — one user said their $30 plan was exhausted in roughly two hours.

There’s a confusion worth clearing up: some YouTube videos claim Qwen 3.7 Max scored 72.5 on SWE-Bench Verified, but that number simply doesn’t appear in any primary source. The real score, confirmed by the official Qwen blog and multiple independent sources, is 80.4%. The 72.5 is likely a mix-up with Qwen3-Max-Instruct, an earlier model that actually scored 69.6%.

Why It Matters

Qwen 3.7 Max is China’s strongest entry yet in the frontier model race. Its coding performance is genuinely world-class — not a “good effort considering” but competitive against the best the West has to offer. But its unbalanced profile and hidden costs are important warnings.

For developers looking for a pure coding assistant, Qwen 3.7 Max is a serious option, especially at its price. But you have to go in with open eyes: the cache needs proper configuration, the verbosity needs to be managed, and if you need anything creative, you’re better off looking elsewhere.

The most unbalanced model on the market is also, for certain use cases, the best.

Main source: Qwen3.7: The Agent Frontier — Official Alibaba/Qwen Blog