AI models get better at math but still get low marks

"A calculator is predictable. Ask it the same question today or next year, and the answer stays the same. AI doesn't work that way. These systems are predicting the next likely word based on patterns. Mathematically, it's possible for a model to get a question right today and wrong tomorrow."

"Current-day LLMs are prediction engines and, as such, they can only find the most likely solution to problems, which is not necessarily the correct one. Though popular models have mostly become better at math, even top performer Gemini 3 Flash would receive a C if assessed with a letter grade."

Researchers at Omni Calculator evaluated multiple large language models using the ORCA Benchmark, a test of 500 practical math questions. In the latest testing round, Gemini 3 Flash achieved 72.8% accuracy, DeepSeek V3.2 reached 55.2%, and ChatGPT 5.2 scored 54.0%, showing improvements from previous versions. However, Grok 4.1 regressed to 60.2%. These results demonstrate that LLMs function as prediction engines finding statistically likely solutions rather than mathematically correct ones. Unlike calculators that provide consistent answers, AI models may answer the same question differently due to their pattern-based prediction nature, introducing inherent variability in mathematical problem-solving.

#llm-mathematical-accuracy #orca-benchmark-testing #ai-model-evaluation #prediction-engines-limitations #ai-variability-and-instability

Read at Theregister

Unable to calculate read time

Collection

[

...

]

AI models get better at math but still get low marksAI models get better at math but still get low marks Briefly

AI models get better at math but still get low marks
AI models get better at math but still get low marks
Briefly