We introduce CFE (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. CFE is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. CFE presents a significant challenge even for frontier models: the newly released Gemini-3.1-Pro-Preview achieves an overall accuracy of 59.69%, while the second-best model, Gemini-3-Flash-Preview, reaches 55.46%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation.
Read the full paper and explore the leaderboard →
What's in the Dataset
CFE-Bench contains 449 problems split across two modalities:
- Text-only (305 questions): Primarily physics and mathematics, with coverage in economics, electrical engineering, and computer science.
- Multimodal (144 questions): Physics, mechanical engineering, electrical engineering, and other disciplines requiring diagram or figure interpretation.
All problems require multi-step derivation. A correct answer means getting every intermediate variable right — not just the final number.
How We Evaluate: Short-to-Short (S2S)
STEM evaluation is notoriously brittle. Two algebraically equivalent expressions look different to a string matcher. Long model outputs trigger false positives when they happen to include the right substring.
We address this with variable-based evaluation (S2S). Instead of comparing full responses, we ask the model to report specific named variables, then check each predicted value against instructor-annotated ground truth. S2S achieves 98% accuracy on our validation set and — crucially — the lowest false positive rate of any evaluation method we tested.
What Frontier Models Score Today
The benchmark is genuinely hard. On the combined text + multimodal split, the best model we evaluated — Gemini-3.1-Pro-Preview — achieves 59.7% question accuracy, meaning it gets every variable in a problem correct less than 60% of the time. The gap between variable accuracy and question accuracy is telling: models frequently solve individual sub-steps while failing to carry the correct state through to the end.
We also observe that model-generated solutions contain more reasoning steps on average than instructor solutions. More steps means more opportunities to accumulate errors — and the data bears this out.
Connection to Training Data
At AnalogyAI, we build infrastructure for curating high-quality training datasets. CFE-Bench is a direct product of that work: the same sourcing and curation pipeline we use for clients was applied to authentic academic material to construct a benchmark where data leakage is minimized by design.
We see CFE-Bench as both a research contribution and a demonstration of what intent-driven data curation can produce. The full leaderboard, dataset, and paper are available at the link below.
