OpenAI Releases O3 Model With High Performance and High Cost | NextBigFuture.com

OpenaI o3 sets new records in several key areas, particularly in reasoning, coding and mathematical problem-solving. It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task in compute ) and 87.5% in high-compute mode (thousands of $ per task). It's very expensive. It is not just brute force. These capabilities are new territory and they demand serious scientific attention.

Benchmark Performance

ARC-AGI Benchmark

o3 has achieved a breakthrough score on the ARC-AGI benchmark, which is considered an indicator of progress toward artificial general intelligence:

o3 scored 75.7% using standard computing power

With increased resources (high-compute mode), o3 reached an unprecedented 87.5%

This performance surpasses the human-level threshold of 85% and represents a significant leap from its predecessor, o1, which only scored 32%

Mathematics and Problem-Solving

o3 has great mathematical reasoning and problem-solving:

Nearly perfect score (96.7%) on the 2024 American Mathematical Olympiad (AIME)

25.2% on EpochAI's Frontier Math Benchmark, far exceeding previous models that couldn't break 2%

Coding and Software Engineering

In coding-related tasks, o3 shows substantial improvements:

SWE-Bench Verified: 71.7, which is 22.8 points higher than o1

Codeforces: Achieved an Elo rating of 2,727

Comparison with Gemini 2 and Other Models

While o3 demonstrates exceptional performance, Gemini 2 and other models also show strong capabilities:

Gemini 2.0 Flash

Outperforms its predecessor Gemini 1.5 Pro on key benchmarks6

Excels in competition-level math problems, achieving state-of-the-art results on MATH and HiddenMath6

Performs well in language and multimedia understanding, outperforming GPT-4o on MMLU-Pro6

Model Rankings

In various benchmarks and comparisons:

Chatbot Arena: Gemini 2.0 Experimental Advanced ranks slightly above the latest version of OpenAI's ChatGPT-4o3

MMLU-Pro: Gemini 2.0 Flash outperforms GPT-4o but is behind Claude 3.5 Sonnet

Coding ability: Claude 3.5 Sonnet, GPT-4o, o1-preview, and o1-mini outperform Gemini 2.0 Flash

Performance Metrics

Reasoning and Problem-Solving

OpenAI o3: This model is noted for its exceptional performance, particularly in reasoning over complex problems. It scores significantly higher on benchmarks like ARC-AGI, where it is described as being three times better than its predecessor, the o1 model. It also excels in benchmarks measuring coding and scientific reasoning, with scores of 96.7% on the AIME 2024 exam and 2727 on Codeforces, surpassing even OpenAI's Chief Scientist's score.

Google Gemini 2.0 Flash: While specific performance details for reasoning are less highlighted, Gemini 2.0 Flash is described as fast and efficient, matching or exceeding the performance of its predecessor, Gemini 1.5 Pro, in some benchmarks but with a focus on speed rather than comprehensive reasoning improvements.

Context Window and Speed

OpenAI o3: Has a context window of 128K tokens, which is standard among high-end models but smaller compared to some Gemini models.

Gemini 2.0 Flash: Offers a context window of 1 million tokens, which is significantly larger, allowing for processing larger datasets or documents in one go.

Informed Pulse

OpenAI Releases O3 Model With High Performance and High Cost | NextBigFuture.com

POPULAR CATEGORY

corporate

miscellaneous

wellbeing

fitness