New secret math benchmark stumps AI models and PhDs alike

On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that leading AI models solve less than 2 percent of the time, according to Epoch AI. The benchmark tests AI language models (such as GPT-4o, which powers ChatGPT) against original mathematics problems that typically require hours or days for specialist mathematicians to complete.

FrontierMath’s performance results, revealed in a preprint research paper, paint a stark picture of current AI model limitations. Even with access to Python environments for testing and verification, top models like Claude 3.5 Sonnet, GPT-4o, o1-preview, and Gemini 1.5 Pro scored extremely poorly. This contrasts with their high performance on simpler math benchmarks—many models now score above 90 percent on tests like GSM8K and MATH.

The design of FrontierMath differs from many existing AI benchmarks because the problem set remains private and unpublished to prevent data contamination. Many existing AI models are trained on other test problem datasets, allowing the AI models to easily solve the problems and appear more generally capable than they actually are. Many experts cite this as evidence that current large language models (LLMs) are poor generalist learners.

Read full article

Comments

New secret math benchmark stumps AI models and PhDs alike

Leave a Reply Cancel reply

Research roundup: 6 cool science stories we almost missed

Inside the marketplace for vaccine medical exemptions

YouTube denies AI was involved with odd removals of tech tutorials

Polyurethane is the latest polymer broken down by designer enzymes

You Missed

Research roundup: 6 cool science stories we almost missed

Inside the marketplace for vaccine medical exemptions

YouTube denies AI was involved with odd removals of tech tutorials

Polyurethane is the latest polymer broken down by designer enzymes

New secret math benchmark stumps AI models and PhDs alike

Related Post

Leave a Reply Cancel reply

You Missed