Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Chongyang Gao1, Diji Yang2, Shuyan Zhou3, Xichen Yan4, Luchuan Song5, Shuo Li6, Kezhen Chen6
1Northwestern University, 2UC Santa Cruz, 3Duke University, 4University of Birmingham, 5University of Rochester, 6Analogy AI, Inc.
cygao@u.northwestern.edu, kezhenchen@analogyai.org

We introduce CFE (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. CFE is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. CFE presents a significant challenge even for frontier models: the newly released Gemini-3.1-Pro-Preview achieves an overall accuracy of 59.69%, while the second-best model, Gemini-3-Flash-Preview, reaches 55.46%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation.

We hope CFE will serve as a realistic and diagnostic testbed for future models, training objectives, and inference strategies that emphasize verifiable intermediate supervision, robust late-stage composition, and efficient reasoning.

We are sincerely grateful to the following course instructors and professors—Eric Carlson, Yidong Chong, Jens Jensen, Kevin Zhou, Brian Naranjo, and Ravishankar Sundararaman—for explicitly approving and authorizing the use and release of their course materials for CFE, and for verifying the corresponding problems and reference solutions.

CFE Benchmark Statistics

CFE-Bench is a cross-disciplinary STEM reasoning benchmark consisting of a text-only split (305 questions) and a multimodal split (144 questions). The text-only subset primarily covers Physics and Mathematics, with additional problems in Economics, Electrical Engineering, and Computer Science. The multimodal subset emphasizes Physics and multiple engineering domains, including Electrical and Mechanical Engineering, alongside a diverse tail of other STEM fields.

Variable-Based Evaluation

Evaluating STEM solutions is notoriously brittle due to algebraic variations and verbose reasoning. To reduce false positives from directly comparing long-form model responses with full reference solutions, we introduce variable based evaluation, denoted as Short-to-Short (S2S).

Setting Acc (%) AUC P R F1 FN FP
S2S 98.03 ± 0.77 0.98 ± 0.01 0.98 ± 0.01 0.98 ± 0.01 0.98 ± 0.01 4.75 ± 1.79 1.25 ± 0.83
L2S 96.72 ± 0.40 0.96 ± 0.01 0.97 ± 0.00 0.97 ± 0.00 0.97 ± 0.00 1.50 ± 0.50 8.50 ± 1.50
L2L 89.67 ± 0.75 0.89 ± 0.01 0.90 ± 0.01 0.90 ± 0.01 0.90 ± 0.01 11.00 ± 2.55 20.50 ± 1.12

Table 1: Performance summary across evaluation settings. S2S achieves the highest accuracy and the lowest False Positive (FP) rate.

Our S2S framework anchors evaluation to expert-annotated ground-truth variables: \( V_{\mathrm{gt}}=\{(v_i,d_i,x_i,t_i)\} \). By extracting specific predicted values \( \hat{x}_i \) and comparing them against these targets, we eliminate the ambiguity of holistic narrative matching. As shown above, S2S yields the strongest overall performance and substantially reduces false positives relative to L2L, providing a more conservative and discriminative measure of model capability.

Model Performance on CFE-Bench

We evaluate a broad set of state-of-the-art models using our variable-based protocol with two metrics: Variable Accuracy (average fraction of correct variables per question) and Question Accuracy (all variables must be correct). Across both subsets, Question Accuracy is substantially lower, revealing that models often solve some components while failing others.

Text + Multimodal (449)
Model Variable Acc. Question Acc.
Open Llama-4-Maverick 21.82% 16.48%
Qwen3.5-397B 53.60% 47.44%
Prop. Qwen3.5-plus 53.60% 47.00%
Claude-Opus-4.6 54.15% 47.66%
Grok-4-0709 47.78% 41.65%
Grok-4.1-fast-reasoning 44.18% 38.08%
GPT-5.2 55.80% 48.78%
Gemini-3-Flash-Preview 62.90% 55.46%
Gemini-3-Pro-Preview 62.70% 55.01%
Gemini-3.1-Pro-Preview 66.04% 59.69%

Open-weights   Proprietary  |  Bold underline = best in group, underline = second-best in group.

Text Subset (305)
Model Variable Acc. Question Acc.
Open Gemma-3-27B-it 13.79% 9.84%
Ministral-3-14B-Reasoning 17.92% 13.11%
Llama-4-Maverick 24.53% 19.67%
GPT-oss-120b 41.15% 34.43%
Qwen3-235B-Instruct 37.41% 32.46%
Qwen3-235B-Thinking 39.31% 32.79%
Qwen3.5-397B 54.12% 48.52%
MiniMax-M2.1 33.44% 27.54%
MiniMax-M2.5 34.46% 28.52%
Kimi-K2-Instruct 24.62% 19.02%
Kimi-K2-Thinking 46.28% 39.02%
Kimi-K2.5 51.32% 43.93%
GLM-4.7 44.79% 39.02%
GLM-5 47.24% 41.64%
DeepSeek V3.2 (chat) 48.08% 41.64%
DeepSeek V3.2 (reasoner) 50.07% 43.28%
Prop. Qwen3.5-plus 54.43% 48.20%
Claude-Sonnet-4.5 36.74% 29.51%
Claude-Opus-4.5 49.03% 41.97%
Claude-Opus-4.6 58.95% 52.79%
Grok-4-0709 53.24% 47.54%
Grok-4.1-fast-reasoning 49.58% 43.61%
GPT-5.2 57.99% 51.15%
Gemini-3-Flash-Preview 66.02% 58.69%
Gemini-3-Pro-Preview 65.29% 58.03%
Gemini-3.1-Pro-Preview 70.66% 64.92%
Multimodal Subset (144)
Model Variable Acc. Question Acc.
Open Gemma-3-27B-it 6.83% 2.78%
Llama-4-Maverick 16.09% 9.72%
InternVL3-78B-Instruct 6.52% 2.78%
InternVL3.5-GPT-OSS-20B 3.81% 2.08%
InternVL3-5-38B 10.23% 5.56%
InternVL3.5-241B-A28B 10.76% 4.86%
Qwen3-VL-32B-Instruct 18.99% 10.42%
Qwen3.5-397B 52.50% 45.14%
GLM-4.6v 15.15% 7.64%
Prop. Qvq-max 9.58% 5.56%
Qwen3.5-plus 52.32% 44.44%
Claude-Sonnet-4.5 27.04% 18.75%
Claude-Opus-4.5 38.50% 30.56%
Claude-Opus-4.6 43.99% 36.81%
Grok-4-0709 36.23% 29.17%
Grok-4.1-fast-reasoning 32.73% 26.39%
GPT-5.2 51.17% 43.75%
Gemini-3-Flash 56.31% 48.61%
Gemini-3-Pro-Preview 57.22% 48.61%
Gemini-3.1-Pro-Preview 56.26% 48.61%

Open-weights models   Proprietary models  |  Bold underline = best in group, underline = second-best in group.

On the text-only split, Gemini-3.1-Pro-Preview leads with 64.92% question accuracy, followed by GPT-5.2 (51.15%) and Claude-Opus-4.6 (52.79%). Among open-weight models, Qwen3.5-397B tops at 48.52%, with Kimi-K2.5 (43.93%) close behind.

The multimodal split is harder across the board. The Gemini-3 family leads (48.61% Q. Acc.), while Qwen3.5 reaches 45.14%, narrowing the open-proprietary gap. However, most other open-weight vision-language models remain at or below 10%, revealing a sharp capability cliff outside the top tier.

Deconstructing the Frontier Model Performance Gap

To understand why even strong models remain far from reliable, we focus on Gemini-3-Flash and analyze only the instances it fails to solve end-to-end. Our diagnosis is organized around three questions: (Q1) Does the model fail due to missing reasoning/knowledge at the atomic level? (Q2) Is failure primarily driven by multi-step reasoning and error accumulation? (Q3) Does providing a single critical reasoning unit (e.g., a key fact or transformation) substantially increase the probability of reaching the correct final answer? We also assess reasoning efficiency by comparing the length of model-generated reasoning flows to the ground truth.

Formalizing the Reasoning Flow

We decompose each reference solution into an ordered sequence of verifiable reasoning units \( R = [u_1, u_2, \dots, u_n] \), where each unit \( u_i = \langle u_i^q, u_i^a \rangle \) pairs a sub-question (isolating a single step such as retrieving a fact, applying a formula, or performing a derivation) with its verifiable target answer. This representation enables step-wise evaluation, allowing us to probe whether models fail due to atomic deficits (inability to execute a single unit) or compositional deficits (inability to chain otherwise-solvable units), and to define controlled interventions that condition model outputs on partial reasoning states. We instantiate \( R \) via a two-stage pipeline: first, a judge model (GPT-mini) proposes a candidate decomposition into units; then, human annotators verify that each sub-question is unambiguous and each target answer is objectively checkable. The table below illustrates a representative example.

Question: Short circuited parallel plate electrodes of area \( A \) enclose a lossy dielectric of thickness \( s \) with dielectric permittivity \( \varepsilon \) and ohmic conductivity \( \sigma \). The lossy dielectric at time \( t=0 \) has a uniformly distributed free volume charge density \( \rho_0 \). Neglect fringing field effects. What is the current \( i(t) \) flowing through the short circuit?

Step Question \( u_i^q \) Answer \( u_i^a \)
\( u_1 \) Write the expression for the total current density in a lossy dielectric in terms of \( \sigma \), \( \varepsilon \), and \( E_x \) at the electrode (\( x = s \)). \( \displaystyle \frac{i(t)}{A} = \sigma E_x\big|_{x=s} + \varepsilon \frac{\partial E_x}{\partial t}\bigg|_{x=s} \)
\( u_2 \) For a uniform free volume charge density \( \rho_f(t) \) inside the slab of thickness \( s \), state the relation between \( E_x(x{=}s) \) and \( \rho_f(t) \) using Gauss's law. \( \displaystyle E_x(x{=}s) = \frac{\rho_f(t)\, s}{2\varepsilon} \)
\( u_3 \) Differentiate the expression from Step 2 with respect to time to obtain \( \partial E_x / \partial t \) at \( x = s \). \( \displaystyle \left.\frac{\partial E_x}{\partial t}\right|_{x=s} = \frac{s}{2\varepsilon}\,\frac{\partial \rho_f}{\partial t} \)
\( u_4 \) Substitute the expressions for \( E_x(x{=}s) \) and \( \partial E_x/\partial t \) from Steps 2 and 3 into the current density expression from Step 1. \( \displaystyle \frac{i(t)}{A} = \sigma\,\frac{\rho_f\, s}{2\varepsilon} + \varepsilon\,\frac{s}{2\varepsilon}\,\frac{\partial \rho_f}{\partial t} \)
\( u_5 \) Simplify the expression from Step 4 algebraically. \( \displaystyle \frac{i(t)}{A} = \frac{\sigma\, s}{2\varepsilon}\,\rho_f + \frac{s}{2}\,\frac{\partial \rho_f}{\partial t} \)
\( u_6 \) State the time-dependence of \( \rho_f(t) \) for initial value \( \rho_0 \) in a lossy dielectric with relaxation time \( \tau \). \( \displaystyle \rho_f(t) = \rho_0\, e^{-t/\tau} \)
\( u_7 \) Express the relaxation time \( \tau \) in terms of \( \varepsilon \) and \( \sigma \). \( \displaystyle \tau = \frac{\varepsilon}{\sigma} \)
\( u_8 \) Compute \( \partial \rho_f / \partial t \) from Step 6 using \( \tau \) from Step 7. \( \displaystyle \frac{\partial \rho_f}{\partial t} = -\frac{\sigma}{\varepsilon}\,\rho_0\, e^{-t/\tau} \)
\( u_9 \) Substitute \( \rho_f(t) \) and \( \partial \rho_f / \partial t \) from Steps 6 and 8 into the expression from Step 5. \( \displaystyle \frac{i(t)}{A} = \frac{\sigma\, s}{2\varepsilon}\,\rho_0\, e^{-t/\tau} + \frac{s}{2}\!\left(-\frac{\sigma}{\varepsilon}\,\rho_0\, e^{-t/\tau}\right) \)
\( u_{10} \) Algebraically combine the two terms from Step 9. \( \displaystyle \frac{i(t)}{A} = 0 \)
\( u_{11} \) State the resulting total current \( i(t) \) through the short circuit. \( \displaystyle i(t) = 0 \)

Table: An example reasoning flow. Each unit isolates a single reasoning step — retrieving a domain fact, applying a formula, or performing a local derivation — enabling fine-grained diagnosis of where model reasoning breaks down.

Q1: Unit Execution Ability

Setup. We isolate single-step competence via a unit execution test. For each unit index \( i \), we prompt the model with the original question \( Q \), the preceding units \( [u_1, \ldots, u_{i-1}] \), and the current sub-question \( u_i^q \), then verify whether the model produces the correct answer \( u_i^a \). We run 8 trials per unit and report mean accuracy across all questions whose reasoning flow includes unit \( i \) (excluding indices with fewer than five supporting instances).

Findings. Unit Execution diagnostics figures show that unit execution accuracy is consistently high, typically ~0.8-0.9 in the text subset, with a similar pattern and modest dips in the multimodal subset. This indicates that many end-to-end failures are not explained by an inability to execute individual steps once the correct sub-question is specified.

Figure: Unit Execution accuracy for text and multimodal subsets.

Q2: Reasoning Progression Capability

Setup. To examine long-horizon composition, we measure final-answer accuracy when progressively more of the reasoning flow is provided. For each unit index \( i \), we evaluate four prompting conditions: Reasoning Prefix (\( Q + [u_1, \ldots, u_i] \)), Reasoning Prefix (Questions Only) (\( Q + [u_1^q, \ldots, u_i^q] \)), Single-Unit Injection (\( Q + u_i \)), and Single-Unit Injection (Question Only) (\( Q + u_i^q \)). Each condition is sampled 8 times per question. We stratify by exact reasoning-flow length \( s \), restricting to buckets with at least five questions.

Findings. Across both modalities in the sample-level diagnostics, providing unit answers consistently outperforms providing only unit sub-questions. This gap indicates that the bottleneck is not merely identifying an appropriate decomposition, but reliably constructing correct intermediate states, i.e., producing the concrete intermediate values/expressions and preserving constraints as the derivation progresses. Notably, the performance gap is largest at mid-range unit indices. Early units tend to involve local setup and relatively direct manipulations, where the model can often proceed even without answer supervision; later units are increasingly dominated by the hardest long-flow instances and require a precise final consolidation step. In contrast, mid-chain units typically involve the most error-prone transformations. Providing the unit answers effectively "bridges" these difficult transitions, yielding the largest accuracy gains in the middle of the reasoning flow. Overall, this pattern suggests that current SOTA models are comparatively better at using correct intermediate results once provided, but remain brittle at deriving them and at faithfully maintaining state over long STEM derivations.

Figure: Sample-level diagnostics for the text subset. The red curve shows unit execution accuracy; other curves show final-answer accuracy under unit conditioning (Reasoning Prefix, Questions Only, Single-Unit Injection, and Question Only).

Figure: Sample-level diagnostics for the multimodal subset.

Q3: Critical Reasoning Units

Setup. We test whether end-to-end success depends on a critical intermediate unit by injecting only a single unit \( u_i \) or only its sub-question \( u_i^q \), then evaluating whether the model can correctly answer the final question. We use the same 8-sample protocol as in Q2.

Findings. Across both subsets, Single-Unit Injection yields meaningful gains over the Questions Only variants, while Single-Unit Injection (Question Only) remains low across all diagnostics. This confirms that the missing information is often the unit answer, not merely the decomposition structure. Notably, injecting a single unit with its answer can be nearly as effective as providing a full reasoning prefix without answers, despite conditioning on substantially less context. This indicates that once a correct intermediate value is supplied, the model can proceed with downstream deductions almost as effectively as if guided by many preceding sub-questions. The limiting factor is not knowing what intermediate questions to ask, but reliably deriving the right intermediate answers and carrying them forward without drift.

Reasoning Step Density and Efficiency

Beyond end-to-end accuracy, we analyze how efficiently strong models solve CFE questions. We compare the distribution of reasoning-flow lengths between human-verified ground truth and model-generated solutions. Across both subsets, model responses are on average longer than the ground truth, indicating lower step efficiency. For the text subset, the mean response length is 12.20 versus a ground-truth mean of 10.73 (~14% longer); for the multimodal subset, the gap is larger: 13.86 versus 11.72 (~18% longer). This length inflation indicates that current models do not reliably allocate reasoning steps to efficient intermediate quantities.

Figure: Reasoning-flow length distribution. Histograms show the frequency of questions at each length; dashed vertical lines indicate corresponding mean lengths.

Statistic Text Subset (n=305) Multimodal Subset (n=144)
Solved Unsolved Overall Solved Unsolved Overall
# Questions 179 126 305 74 70 144
GT len. (mean±std) 10.60 ± 4.06 10.91 ± 4.07 10.73 ± 4.07 11.04 ± 3.39 12.44 ± 5.56 11.72 ± 4.63
Resp. len. (mean±std) 11.91 ± 3.62 12.62 ± 4.34 12.20 ± 3.95 13.19 ± 4.13 14.57 ± 5.05 13.86 ± 4.65

Table: Reasoning-flow length statistics by outcome. GT = ground-truth, Resp. = model response.

Takeaways and Implications

Our diagnostic experiments yield three main takeaways about why frontier models underperform on CFE, together with concrete implications for improving future SOTA systems.

(T1) Atomic competence is not the primary bottleneck. Under the unit-execution test (Q1), the model attains consistently high accuracy, indicating that many end-to-end failures are not driven by missing isolated facts or inability to perform a single local derivation once the correct sub-question is specified.

(T2) Correct intermediate answers are critical. Across both text and multimodal subsets, conditions that provide unit answers consistently outperform their corresponding "questions-only" variants, with the largest gains appearing at mid-range unit indices. This suggests that the primary bottleneck is not simply identifying a plausible reasoning flow, but reliably deriving and maintaining correct intermediate states throughout the solution process. Moreover, injecting a single unit together with its answer can be nearly as effective as providing a much longer reasoning prefix without answers. Taken together, these results indicate that the truly critical signal is often the correct intermediate answer itself: once a key intermediate value or statement is available, it can unlock downstream reasoning and substantially improve end-to-end success.

(T3) Current reasoning is inefficient. Models generate longer reasoning flows than the human ground truth in both subsets. This length inflation indicates lower reasoning efficiency, with extra steps creating more opportunities for intermediate drift and error accumulation.

Implications. A promising direction is stronger supervision of intermediate states, e.g., step-verified targets, constraint checking, and curricula that reward correct intermediate values rather than only final answers or fluent explanations. These findings also motivate hybrid systems that (i) compute or retrieve key intermediate values using stronger tools (e.g., symbolic solvers, verified calculators, or structured retrieval) and (ii) condition the model on these validated intermediates. More broadly, improving CFE performance will require more efficient reasoning; training objectives that penalize redundant steps and reward compact derivations are likely to improve both accuracy and efficiency.

BibTeX

@@misc{gao2026classroomfinalexaminstructortested,
      title={Classroom Final Exam: An Instructor-Tested Reasoning Benchmark}, 
      author={Chongyang Gao and Diji Yang and Shuyan Zhou and Xichen Yan and Luchuan Song and Shuo Li and Kezhen Chen},
      year={2026},
      eprint={2602.19517},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.19517}, 
}