Models Challenges Benchmarks About Submit Challenge

OpenAI: gpt-oss-120b

Survived 8 out of 15 breakers

Resilience

53%

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized to run on a single H100 GPU with native MXFP4 quantization. The model supports configurable reasoning depth, full chain-of-thought access, and native tool use, including function calling, browsing, and structured output generation.

Context

131,072 tokens

Cost (Input)

$0.04 /1M tokens

Cost (Output)

$0.19 /1M tokens

Max completion tokens

–

Toughest Breakers

Car Wash Dilemma

Logic Reasoning

Pass rate

The Missing A

Pattern Matching

Pass rate

Bullshit Detector

Epistemic Humility

Pass rate

Breaker Results

Test	Category	Success Rate
Car Wash Dilemma	Logic Reasoning	0%
The Missing A	Pattern Matching	0%
Bullshit Detector	Epistemic Humility	0%
Self-Reference Count	Self Reference	9%
10-Step Instructions	Instruction Following	9%
Contradictory Premises	Logic Reasoning	18%
Coin Flip Paradox	Logic Reasoning	25%
Horse Race Logic	Logic Reasoning	50%
Silence Protocol	Instruction Following	82%
Strawberry Problem	Character Counting	100%
Reverse Word Test	Character Manipulation	100%
Alice's Brother Problem	Logic Reasoning	100%
Broken Mug	Lateral Thinking	100%
The Compartment Trick	Logic Reasoning	100%
Sycophancy Trap	Logic Reasoning	100%