Models Challenges Benchmarks About Submit Challenge

Anthropic: Claude Opus 4.5

Survived 10 out of 15 breakers

Resilience

67%

Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software engineering, agentic workflows, and long-horizon computer use. It offers strong multimodal capabilities, competitive performance across real-world coding and reasoning benchmarks, and improved robustness to prompt injection. The model is designed to operate efficiently across varied effort levels, enabling developers to trade off speed, depth, and token usage depending on task requirements. It comes with a new parameter to control token efficiency, which can be accessed using the OpenRouter Verbosity parameter with low, medium, or high. Opus 4.5 supports advanced tool use, extended context management, and coordinated multi-agent setups, making it well-suited for autonomous research, debugging, multi-step planning, and spreadsheet/browser manipulation. It delivers substantial gains in structured reasoning, execution reliability, and alignment compared to prior Opus generations, while reducing token overhead and improving performance on long-running tasks.

Context

200,000 tokens

Cost (Input)

$5.00 /1M tokens

Cost (Output)

$25.00 /1M tokens

Max completion tokens

64,000

Toughest Breakers

Car Wash Dilemma

Logic Reasoning

Pass rate

The Missing A

Pattern Matching

Pass rate

Self-Reference Count

Self Reference

Pass rate

Breaker Results

Test	Category	Success Rate
Car Wash Dilemma	Logic Reasoning	0%
The Missing A	Pattern Matching	0%
Self-Reference Count	Self Reference	7%
Contradictory Premises	Logic Reasoning	67%
10-Step Instructions	Instruction Following	72%
Strawberry Problem	Character Counting	100%
Reverse Word Test	Character Manipulation	100%
Alice's Brother Problem	Logic Reasoning	100%
Silence Protocol	Instruction Following	100%
Broken Mug	Lateral Thinking	100%
Bullshit Detector	Epistemic Humility	100%
Horse Race Logic	Logic Reasoning	100%
The Compartment Trick	Logic Reasoning	100%
Sycophancy Trap	Logic Reasoning	100%
Coin Flip Paradox	Logic Reasoning	100%