Contradictory Premises
Logic Reasoning
Survived 11 out of 15 breakers
OpenAI o4-mini is a compact reasoning model in the o-series, optimized for fast, cost-efficient performance while retaining strong multimodal and agentic capabilities. It supports tool use and demonstrates competitive reasoning and coding performance across benchmarks like AIME (99.5% with Python) and SWE-bench, outperforming its predecessor o3-mini and even approaching o3 in some domains. Despite its smaller size, o4-mini exhibits high accuracy in STEM tasks, visual problem solving (e.g., MathVista, MMMU), and code editing. It is especially well-suited for high-throughput scenarios where latency or cost is critical. Thanks to its efficient architecture and refined reinforcement learning training, o4-mini can chain tools, generate structured outputs, and solve multi-step tasks with minimal delay—often in under a minute.
200,000 tokens
$1.10 /1M tokens
$4.40 /1M tokens
100,000
| Test | Category | Latest Result | Success Rate | |
|---|---|---|---|---|
| Contradictory Premises | Logic Reasoning | 11% | ||
| Self-Reference Count | Self Reference | 22% | ||
| 10-Step Instructions | Instruction Following | 22% | ||
| The Missing A | Pattern Matching | 25% | ||
| Horse Race Logic | Logic Reasoning | 25% | ||
| Car Wash Dilemma | Logic Reasoning | 75% | ||
| Coin Flip Paradox | Logic Reasoning | 75% | ||
| Silence Protocol | Instruction Following | 78% | ||
| Strawberry Problem | Character Counting | 100% | ||
| Reverse Word Test | Character Manipulation | 100% | ||
| Alice's Brother Problem | Logic Reasoning | 100% | ||
| Broken Mug | Lateral Thinking | 100% | ||
| Bullshit Detector | Epistemic Humility | 100% | ||
| The Compartment Trick | Logic Reasoning | 100% | ||
| Sycophancy Trap | Logic Reasoning | 100% |