Imagine a classroom full of bright young students, all striving to do their best on a high school math test.
They sit, heads slightly bowed, pencils ready, each one thinking about how to tackle each individual problem. They display human responses to their surroundings, crossing or uncrossing their legs, chewing on their erasers, throwing their eyes around the room as they consider their answers.
Now, imagine that all of those humans students were LLMs. The teacher calls each one by name – gives them a task – and they get started, answering the same high school math problems. There’s no human behavior – it’s all done in that void that constitutes the place where LLMs live – but the teacher can still make the same evaluations, and create the same rosters to show how each student is doing when it comes to quantified math performance: did Johnny do better than Susie?
People outside of the industry usually don’t know that this is where the community is at now with its work on AI agents.
The MATH Data Set
Industry resources show that OpenAI created a MATH data set with 12,500 problems from high school curiculum, covering such a range of mathematical areas as algebra, geometry, probability, and number theory.
There’s even some calculus.
Problems vary in difficulty, some requiring complex, multi-step reasoning, or even creative reasoning.
Experts explain that one of the applications of the MATH data set itself is to evaluate the performance of LLMs in educational settings.
Meet Some of the Students
What you can do, in order to understand where LLMs are, is break out some of the most compelling examples of AI students and how they did on a MATH test. For example, Minerva by Google has achieved a fair 50% accuracy rate.
A model called Llama 2 7B has done well when selecting the best response from 256 random generations in the MATH data set, where it achieved an accuracy of 72.0%. For the first generated answer, that accuracy rate was 7.9%.
A Skywork Math 7B model got 51.2%, which is better than some early versions of GPT-4 scored when trained on the MATH data set.
Then there is a Qwen Alibaba model that got 83.6%.
The list goes on, but here you see some of the best fruits of this kind of research – with the individual models called by name.
Applications of MATH Data Set Research
So what’s the main goal here?
Here’s how ChatGPT 40 describes the evaluation of its colleagues:
“Scientists train Large Language Models (LLMs) on datasets like the MATH dataset to assess and improve their mathematical reasoning abilities, a critical component of general intelligence. The MATH dataset includes diverse and challenging problems across areas like algebra, geometry, calculus, and combinatorics, providing a rigorous benchmark for evaluating how well LLMs can perform precise, logical reasoning tasks. Success on such datasets demonstrates a model’s capacity for structured problem-solving, critical thinking, and the application of learned concepts, which are valuable for a wide range of real-world applications, from scientific research to coding and automated tutoring.”
In math classrooms across America, human students are learning. In the labs, AI students are learning, too. How will these two trends interact?
Well, for starters, there’s AI-powered tutoring, which can give every American student access to the Socratic method and the back-and-forth that’s useful in developing a lot of cognitive skills. (Shout out to Dr, John Sviokla who was talking about this a few weeks ago.)
That said, LLMs could grade the work of students as well.
It’s all part of the very new frontier that we’re looking at in terms of the LLM’s ability to interact with humans in new ways. Keep an eye out for this kind of academic-oriented research to continue.
Read the full article here