Anthropic: Claude Opus 4.5

Survived 10 out of 15 breakers

Resilience
67%

Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software engineering, agentic workflows, and long-horizon computer use. It offers strong multimodal capabilities, competitive performance across real-world coding and reasoning benchmarks, and improved robustness to prompt injection. The model is designed to operate efficiently across varied effort levels, enabling developers to trade off speed, depth, and token usage depending on task requirements. It comes with a new parameter to control token efficiency, which can be accessed using the OpenRouter Verbosity parameter with low, medium, or high. Opus 4.5 supports advanced tool use, extended context management, and coordinated multi-agent setups, making it well-suited for autonomous research, debugging, multi-step planning, and spreadsheet/browser manipulation. It delivers substantial gains in structured reasoning, execution reliability, and alignment compared to prior Opus generations, while reducing token overhead and improving performance on long-running tasks.

Context

200,000 tokens

Cost (Input)

$5.00 /1M tokens

Cost (Output)

$25.00 /1M tokens

Max completion tokens

64,000

Toughest Breakers

Breaker Results

TestCategoryLatest ResultSuccess Rate
Car Wash DilemmaLogic Reasoning0%
The Missing APattern Matching0%
Self-Reference CountSelf Reference7%
Contradictory PremisesLogic Reasoning67%
10-Step InstructionsInstruction Following72%
Strawberry ProblemCharacter Counting100%
Reverse Word TestCharacter Manipulation100%
Alice's Brother ProblemLogic Reasoning100%
Silence ProtocolInstruction Following100%
Broken MugLateral Thinking100%
Bullshit DetectorEpistemic Humility100%
Horse Race LogicLogic Reasoning100%
The Compartment TrickLogic Reasoning100%
Sycophancy TrapLogic Reasoning100%
Coin Flip ParadoxLogic Reasoning100%