OpenaI o3 sets new records in several key areas, particularly in reasoning, coding and mathematical problem-solving. It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task in compute ) and 87.5% in high-compute mode (thousands of $ per task). It's very expensive. It is not just brute force. These capabilities are new territory and they demand serious scientific attention.
Benchmark Performance
ARC-AGI Benchmark
o3 has achieved a breakthrough score on the ARC-AGI benchmark, which is considered an indicator of progress toward artificial general intelligence:
o3 scored 75.7% using standard computing power
With increased resources (high-compute mode), o3 reached an unprecedented 87.5%
This performance surpasses the human-level threshold of 85% and represents a significant leap from its predecessor, o1, which only scored 32%
Mathematics and Problem-Solving
o3 has great mathematical reasoning and problem-solving:
Nearly perfect score (96.7%) on the 2024 American Mathematical Olympiad (AIME)
25.2% on EpochAI's Frontier Math Benchmark, far exceeding previous models that couldn't break 2%
Coding and Software Engineering
In coding-related tasks, o3 shows substantial improvements:
SWE-Bench Verified: 71.7, which is 22.8 points higher than o1
Codeforces: Achieved an Elo rating of 2,727
Comparison with Gemini 2 and Other Models
While o3 demonstrates exceptional performance, Gemini 2 and other models also show strong capabilities:
Gemini 2.0 Flash
Outperforms its predecessor Gemini 1.5 Pro on key benchmarks6
Excels in competition-level math problems, achieving state-of-the-art results on MATH and HiddenMath6
Performs well in language and multimedia understanding, outperforming GPT-4o on MMLU-Pro6
Model Rankings
In various benchmarks and comparisons:
Chatbot Arena: Gemini 2.0 Experimental Advanced ranks slightly above the latest version of OpenAI's ChatGPT-4o3
MMLU-Pro: Gemini 2.0 Flash outperforms GPT-4o but is behind Claude 3.5 Sonnet
Coding ability: Claude 3.5 Sonnet, GPT-4o, o1-preview, and o1-mini outperform Gemini 2.0 Flash
Performance Metrics
Reasoning and Problem-Solving
OpenAI o3: This model is noted for its exceptional performance, particularly in reasoning over complex problems. It scores significantly higher on benchmarks like ARC-AGI, where it is described as being three times better than its predecessor, the o1 model. It also excels in benchmarks measuring coding and scientific reasoning, with scores of 96.7% on the AIME 2024 exam and 2727 on Codeforces, surpassing even OpenAI's Chief Scientist's score.
Google Gemini 2.0 Flash: While specific performance details for reasoning are less highlighted, Gemini 2.0 Flash is described as fast and efficient, matching or exceeding the performance of its predecessor, Gemini 1.5 Pro, in some benchmarks but with a focus on speed rather than comprehensive reasoning improvements.
Context Window and Speed
OpenAI o3: Has a context window of 128K tokens, which is standard among high-end models but smaller compared to some Gemini models.
Gemini 2.0 Flash: Offers a context window of 1 million tokens, which is significantly larger, allowing for processing larger datasets or documents in one go.