Models Challenges Benchmarks About Submit Challenge

OpenAI: o4 Mini

Survived 11 out of 15 breakers

Resilience

73%

OpenAI o4-mini is a compact reasoning model in the o-series, optimized for fast, cost-efficient performance while retaining strong multimodal and agentic capabilities. It supports tool use and demonstrates competitive reasoning and coding performance across benchmarks like AIME (99.5% with Python) and SWE-bench, outperforming its predecessor o3-mini and even approaching o3 in some domains. Despite its smaller size, o4-mini exhibits high accuracy in STEM tasks, visual problem solving (e.g., MathVista, MMMU), and code editing. It is especially well-suited for high-throughput scenarios where latency or cost is critical. Thanks to its efficient architecture and refined reinforcement learning training, o4-mini can chain tools, generate structured outputs, and solve multi-step tasks with minimal delay—often in under a minute.

Context

200,000 tokens

Cost (Input)

$1.10 /1M tokens

Cost (Output)

$4.40 /1M tokens

Max completion tokens

100,000

Toughest Breakers

Contradictory Premises

Logic Reasoning

Pass rate

11%

Self-Reference Count

Self Reference

Pass rate

22%

10-Step Instructions

Instruction Following

Pass rate

22%

Breaker Results

Test	Category	Success Rate
Contradictory Premises	Logic Reasoning	11%
Self-Reference Count	Self Reference	22%
10-Step Instructions	Instruction Following	22%
The Missing A	Pattern Matching	25%
Horse Race Logic	Logic Reasoning	25%
Car Wash Dilemma	Logic Reasoning	75%
Coin Flip Paradox	Logic Reasoning	75%
Silence Protocol	Instruction Following	78%
Strawberry Problem	Character Counting	100%
Reverse Word Test	Character Manipulation	100%
Alice's Brother Problem	Logic Reasoning	100%
Broken Mug	Lateral Thinking	100%
Bullshit Detector	Epistemic Humility	100%
The Compartment Trick	Logic Reasoning	100%
Sycophancy Trap	Logic Reasoning	100%