Models Challenges Benchmarks About Submit Challenge

Anthropic: Claude Opus 4.1

Survived 8 out of 15 breakers

Resilience

53%

Claude Opus 4.1 is an updated version of Anthropic’s flagship model, offering improved performance in coding, reasoning, and agentic tasks. It achieves 74.5% on SWE-bench Verified and shows notable gains in multi-file code refactoring, debugging precision, and detail-oriented reasoning. The model supports extended thinking up to 64K tokens and is optimized for tasks involving research, data analysis, and tool-assisted reasoning.

Context

200,000 tokens

Cost (Input)

$15.00 /1M tokens

Cost (Output)

$75.00 /1M tokens

Max completion tokens

32,000

Toughest Breakers

Self-Reference Count

Self Reference

Pass rate

Silence Protocol

Instruction Following

Pass rate

Contradictory Premises

Logic Reasoning

Pass rate

Breaker Results

Test	Category	Success Rate
Self-Reference Count	Self Reference	0%
Silence Protocol	Instruction Following	0%
Contradictory Premises	Logic Reasoning	0%
Broken Mug	Lateral Thinking	0%
Car Wash Dilemma	Logic Reasoning	0%
10-Step Instructions	Instruction Following	22%
Bullshit Detector	Epistemic Humility	50%
Horse Race Logic	Logic Reasoning	75%
Strawberry Problem	Character Counting	100%
Reverse Word Test	Character Manipulation	100%
Alice's Brother Problem	Logic Reasoning	100%
The Missing A	Pattern Matching	100%
The Compartment Trick	Logic Reasoning	100%
Sycophancy Trap	Logic Reasoning	100%
Coin Flip Paradox	Logic Reasoning	100%