Why do Qwen benchmarks age quickly?

Benchmark scores age because new model releases regularly surpass previous top scores, and because the research community discovers evaluation artifacts that make scores look better than they are in practice. A Qwen score that ranked first on a leaderboard six months ago may now be mid-tier. Always check the benchmark date alongside the score.

Benchmarks | Qwen Performance Reference

Q: How does Qwen score on MMLU?

Qwen flagship models typically score in the high-80s to low-90s percentage range on MMLU (Massive Multitask Language Understanding), placing them among the top open-weight models at their parameter class. Smaller Qwen variants score lower, in the mid-60s to mid-70s range at 7B class. Exact scores depend on the generation and the evaluation methodology used.

Compact Overview

Qwen flagship models consistently place in the top tier of open-weight benchmarks on MMLU, HumanEval, and GSM8K. Benchmark scores age in months — a leading score today may be mid-table in six months. The LMSYS Chatbot Arena gives a human-preference signal that complements static benchmarks. Always run your own prompt sample before committing to a generation for production use.

The benchmark landscape for open-weight LLMs

A map of the major public evaluation suites used to score Qwen and comparable open-weight models — what each covers and why none of them fully captures real-world deployment quality.

Benchmarks for large language models have proliferated quickly. At the time of writing, there are dozens of public evaluation suites covering general knowledge, reasoning, code generation, mathematics, multilingual capability, and instruction-following quality. This abundance is useful — a model that leads on several independent benchmarks is more convincingly good than one that leads on only one — but it also creates noise. Some benchmarks are saturated (most good models score above 85%), some are gameable through clever prompt engineering, and some measure capabilities that are genuinely not correlated with production usefulness.

The Qwen team publishes benchmark results alongside each major release, and the numbers are generally trustworthy at face value: the evaluation methodology is standard and reproducible. The caveats are the same for Qwen as for any open-weight model family: the scores reflect a specific evaluation run under specific conditions, and a different run under different conditions (different prompt format, different few-shot examples, different decoding settings) can move the number meaningfully. Treat published benchmark scores as order-of-magnitude signals, not precise measurements.

MMLU: general knowledge and reasoning across 57 subject areas

MMLU (Massive Multitask Language Understanding) tests multiple-choice questions across 57 subjects — the most widely cited general-capability benchmark for LLMs.

MMLU is a multiple-choice benchmark covering 57 subject areas including mathematics, sciences, humanities, law, medicine, and social sciences. It is the single most widely cited benchmark in LLM evaluation because its breadth makes a high score harder to achieve by memorising a narrow domain. A model that scores well on MMLU has demonstrated broad knowledge and reasoning capability across a genuinely wide range of topics.

Qwen flagship text models typically score in the high-80s to low-90s percentage range on MMLU at the 72B parameter class, placing them among the top open-weight models in that tier. Smaller Qwen variants score lower — the 7B class typically lands in the mid-60s to mid-70s range depending on the generation. The gap between 7B and 72B on MMLU is large enough that teams with quality requirements above the mid-70s should evaluate whether the 7B tier is sufficient before deploying it in production.

MMLU has a known weakness: many of its questions can be answered correctly by a model that has seen the test set in its training data. Benchmark contamination is a real concern for any model trained on large internet-scale corpora, and Qwen is no exception. Scores should be interpreted as "the model demonstrates this level of general knowledge and reasoning capability" rather than "the model reliably solves novel problems in these subject areas".

HumanEval: code generation capability

HumanEval tests a model's ability to write correct Python functions from docstring specifications — the standard benchmark for code generation quality.

HumanEval consists of 164 Python programming problems where the model is given a function signature and docstring and must complete the function body. A solution is scored as pass@k — the probability that at least one of k generated solutions passes the unit tests. The pass@1 score (a single generation must pass) is the most commonly cited variant and the most relevant for production code generation use cases where you want the first output to be correct.

Qwen code-specialised variants typically score in the upper quartile of open-weight models on HumanEval at their parameter class. The code-instruct variants generally outperform same-size general text-instruct variants by 10–20 percentage points on pass@1, which is significant. If code generation is a primary use case, using the code-specialised Qwen variant rather than the general text variant is the correct choice.

HumanEval's limitation is that it covers only Python, only function-completion tasks, and only problems that can be verified by unit tests in a few seconds. It does not measure refactoring quality, documentation generation, multi-file context handling, or the ability to reason about existing code. MBPP (Mostly Basic Programming Problems) covers a similar space with a different problem set and is often reported alongside HumanEval to reduce the risk of over-fitting to one evaluation's quirks.

GSM8K and MATH: mathematical reasoning

GSM8K tests grade-school math reasoning in natural language; MATH covers competition-level problems. Together they bracket the mathematical reasoning capability of a model.

GSM8K (Grade School Math 8K) is a benchmark of 8,500 grade-school math word problems. Each problem requires several arithmetic steps and a final numerical answer. It is a test of sequential reasoning ability as much as mathematical knowledge — the problems require the model to parse a natural-language description, identify the relevant quantities, and execute a multi-step calculation correctly. Qwen flagship models typically score above 90% on GSM8K at the 72B class. The 7B class scores in the 70s to low 80s depending on generation.

MATH is substantially harder — competition-level mathematics covering algebra, geometry, number theory, and calculus. Scores on MATH are lower for all model families, and Qwen's performance here has improved most noticeably in the third generation. The gap between Qwen generations is more pronounced on MATH than on GSM8K, which reflects the fact that deeper reasoning improvements show up at the harder end of the difficulty spectrum first.

Multilingual evaluations

Qwen's multilingual training data gives it a distinctive advantage on Chinese, Arabic, and several other language evaluation suites compared to Western-first open-weight models.

Multilingual evaluation suites — including C-EVAL (Chinese), AGI-EVAL, and MGSM (Multilingual Grade School Math) — consistently show Qwen's largest advantage over same-size Western open-weight models. The gap is most pronounced on Chinese-language tasks, where Qwen's pre-training data composition provides a substantial advantage. On Arabic and several Southeast Asian languages, Qwen also outperforms Llama and Mistral class models by notable margins, though the gap narrows compared to the Chinese-language advantage.

For teams building multilingual products, the multilingual benchmark scores are the most decision-relevant data point in the Qwen evaluation suite. A model that leads on English MMLU but underperforms on C-EVAL may not be the right choice for a Chinese-language deployment. The inverse is also possible: Qwen may lead on multilingual evaluations while another model leads on English-only tasks. Match the benchmark to the deployment language before drawing conclusions.

The LMSYS Chatbot Arena and human preference rankings

The LMSYS leaderboard uses blind human preference ratings rather than automated scoring — a complement to static benchmarks that captures the subjective quality dimension.

The LMSYS Chatbot Arena is a crowd-sourced preference ranking platform where human raters compare responses from two anonymous models side-by-side and vote for the one they prefer. Because raters do not know which model they are evaluating, the resulting ELO-style rankings reflect genuine human preference rather than compliance with a specific scoring rubric. This makes the LMSYS leaderboard a useful complement to static benchmarks: it captures dimensions like response helpfulness, conversational naturalness, and appropriate detail level that automated benchmarks do not.

Qwen flagship models have consistently ranked in the top tier of open-weight models on the LMSYS leaderboard. The specific rank varies over time as new releases enter the arena, but the pattern across generations has been consistent: Qwen places in the top five open-weight models at the time of each major release. The multilingual and coding categories on the LMSYS leaderboard show Qwen's strongest relative positions.

Why benchmarks age fast — and what to do about it

A benchmark score that ranked first six months ago may be mid-table today. Understanding why scores age helps practitioners avoid anchoring to stale numbers.

Benchmark scores age for three reasons. First, new model releases regularly surpass previous top scores — the open-weight model field is moving fast enough that a leading score can become mid-table in six months. Second, the research community discovers evaluation artefacts — prompt formats or few-shot examples that inflate scores without reflecting genuine capability improvement. Third, as models get larger and smarter, some benchmarks become saturated: if the top ten models all score above 88% on MMLU, the benchmark no longer discriminates between them.

The practical response is to treat benchmark scores as historical snapshots, not permanent rankings. Check the date alongside the score. Use multiple benchmarks to triangulate. And supplement published scores with a small sample of your own prompts — twenty to fifty representative production examples run through the candidate model is more valuable for your decision than any public benchmark number.

For teams building formal model evaluation processes, the NIST AI Risk Management Framework provides structured guidance on evaluation methodology that goes beyond benchmark scores into reliability, safety, and fairness dimensions. Building a lightweight version of that evaluation framework — even just a checklist — before committing to a Qwen generation for production reduces the risk of a benchmark-driven decision that does not hold up under real traffic.

Major Qwen benchmark reference

Six benchmarks: what each measures and Qwen's typical score class at flagship parameter tiers
Benchmark	What it measures	Qwen typical score class
MMLU	Multiple-choice general knowledge across 57 subject areas	High-80s to low-90s at 72B flagship; mid-60s to mid-70s at 7B
HumanEval (pass@1)	Python function completion from docstring spec	Upper quartile of open-weight models; code-instruct variants lead
GSM8K	Grade-school math word problems requiring multi-step arithmetic	Above 90% at 72B; 70s–80s at 7B class depending on generation
MATH	Competition-level mathematics across multiple domains	Strongest gains in third generation; top-tier open-weight at 72B
C-EVAL / MGSM	Chinese-language knowledge and multilingual math reasoning	Leads same-size Western open-weight models by wide margin
LMSYS Chatbot Arena	Blind human preference ratings across diverse conversations	Consistent top-five open-weight position at flagship release time

How to interpret per-class benchmark scores for your workload

A benchmark-to-workload mapping framework — which score to check first depending on what your application actually does.

Not every benchmark is equally relevant to every workload. A team building a multilingual customer support chatbot should weight C-EVAL, MGSM, and the LMSYS multilingual category more heavily than HumanEval. A team building a code review tool should weight HumanEval and MBPP more heavily than MMLU. A team building a legal document analysis tool should look specifically at the law subset of MMLU and run their own domain-specific samples, since no public benchmark precisely covers legal extraction quality.

The right mental model is: use published benchmarks to narrow the candidate field to two or three models that perform well on the dimensions most relevant to your task, then run a domain-specific evaluation on that short list before making a final selection. Benchmarks are a filter, not a verdict.

One final note on benchmark interpretation: score comparisons are only valid when the evaluation conditions are the same. A Qwen score measured with 5-shot examples is not directly comparable to a score measured with 0-shot examples. A score measured with greedy decoding is not comparable to one measured with beam search. When comparing Qwen benchmark numbers to those of another model family, verify that the evaluation protocol matches — or treat the comparison as approximate rather than precise.

"We stopped anchoring to public benchmark tables after the second time a model that scored well on MMLU underperformed on our actual domain prompts. Now we treat benchmarks as a shortlist filter and do our own 50-prompt evaluation before making a production decision. Qwen has consistently been in the shortlist, but the domain evaluation is where the real decision happens."

Solomon K. Wadsworth
Indie Developer · Alderwood Open Stack · Bend, OR

Frequently asked questions about Qwen benchmarks

Four questions covering benchmark scores, the LMSYS leaderboard, benchmark aging, and how to interpret per-class scores for specific workloads.

How does Qwen score on MMLU?

Qwen flagship models at the 72B parameter class typically score in the high-80s to low-90s percentage range on MMLU, placing them among the top open-weight models at that tier. Smaller Qwen variants score in the mid-60s to mid-70s range at the 7B class. Exact scores depend on the generation and evaluation methodology. Always check the benchmark date — scores from six months ago may not reflect the current generation's performance.

What benchmark measures Qwen code performance?

HumanEval and MBPP are the standard benchmarks for code generation. Qwen code-specialised variants consistently score in the upper quartile of open-weight models on HumanEval at their parameter class. The code-instruct variants outperform general text-instruct variants of the same parameter size by roughly 10–20 percentage points on pass@1, making the code-specialised choice clearly correct for code-primary workloads.

Why do Qwen benchmark scores age so quickly?

Benchmark scores age because new model releases regularly surpass previous top scores, because the research community discovers evaluation artefacts that inflate scores, and because some benchmarks become saturated as models improve. A leading Qwen score from six months ago may now be mid-table. Always check the score date, use multiple benchmarks to triangulate, and supplement published scores with a sample of your own production prompts before making a generation selection.

What is the LMSYS leaderboard and where does Qwen place?

The LMSYS Chatbot Arena is a crowd-sourced preference ranking where human raters compare anonymous model responses side-by-side. It captures subjective quality dimensions — helpfulness, conversational naturalness, appropriate detail — that automated benchmarks miss. Qwen flagship models have consistently ranked in the top five open-weight models on the LMSYS leaderboard at the time of each major release, with particular strength in multilingual and coding categories.

Qwen benchmarks: how the family scores publicly