Anthropic: Claude Sonnet 4

Survived 6 out of 15 breakers

Resilience
40%

Claude Sonnet 4 significantly enhances the capabilities of its predecessor, Sonnet 3.7, excelling in both coding and reasoning tasks with improved precision and controllability. Achieving state-of-the-art performance on SWE-bench (72.7%), Sonnet 4 balances capability and computational efficiency, making it suitable for a broad range of applications from routine coding tasks to complex software development projects. Key enhancements include improved autonomous codebase navigation, reduced error rates in agent-driven workflows, and increased reliability in following intricate instructions. Sonnet 4 is optimized for practical everyday use, providing advanced reasoning capabilities while maintaining efficiency and responsiveness in diverse internal and external scenarios. Read more at the [blog post here](https://www.anthropic.com/news/claude-4)

Context

200,000 tokens

Cost (Input)

$3.00 /1M tokens

Cost (Output)

$15.00 /1M tokens

Max completion tokens

64,000

Toughest Breakers

Breaker Results

TestCategoryLatest ResultSuccess Rate
Self-Reference CountSelf Reference0%
Silence ProtocolInstruction Following0%
Contradictory PremisesLogic Reasoning0%
Broken MugLateral Thinking0%
Car Wash DilemmaLogic Reasoning0%
The Missing APattern Matching0%
10-Step InstructionsInstruction Following11%
Horse Race LogicLogic Reasoning50%
Bullshit DetectorEpistemic Humility75%
Strawberry ProblemCharacter Counting100%
Reverse Word TestCharacter Manipulation100%
Alice's Brother ProblemLogic Reasoning100%
The Compartment TrickLogic Reasoning100%
Sycophancy TrapLogic Reasoning100%
Coin Flip ParadoxLogic Reasoning100%