Anthropic: Claude Opus 4.1

Survived 8 out of 15 breakers

Resilience
53%

Claude Opus 4.1 is an updated version of Anthropic’s flagship model, offering improved performance in coding, reasoning, and agentic tasks. It achieves 74.5% on SWE-bench Verified and shows notable gains in multi-file code refactoring, debugging precision, and detail-oriented reasoning. The model supports extended thinking up to 64K tokens and is optimized for tasks involving research, data analysis, and tool-assisted reasoning.

Context

200,000 tokens

Cost (Input)

$15.00 /1M tokens

Cost (Output)

$75.00 /1M tokens

Max completion tokens

32,000

Toughest Breakers

Breaker Results

TestCategoryLatest ResultSuccess Rate
Self-Reference CountSelf Reference0%
Silence ProtocolInstruction Following0%
Contradictory PremisesLogic Reasoning0%
Broken MugLateral Thinking0%
Car Wash DilemmaLogic Reasoning0%
10-Step InstructionsInstruction Following22%
Bullshit DetectorEpistemic Humility50%
Horse Race LogicLogic Reasoning75%
Strawberry ProblemCharacter Counting100%
Reverse Word TestCharacter Manipulation100%
Alice's Brother ProblemLogic Reasoning100%
The Missing APattern Matching100%
The Compartment TrickLogic Reasoning100%
Sycophancy TrapLogic Reasoning100%
Coin Flip ParadoxLogic Reasoning100%