Re
AI
ty Check
Models
Challenges
Benchmarks
About
Submit Challenge
Models
Challenges
Benchmarks
About
Submit Challenge
anthropic
Anthropic
13 models tracked
Average resilience
63%
Tests Survived
1286
Tests Failed
702
Toughest Breakers
10-Step Instructions
Instruction Following
#1
Pass rate (provider)
8%
Contradictory Premises
Logic Reasoning
#2
Pass rate (provider)
8%
Car Wash Dilemma
Logic Reasoning
#3
Pass rate (provider)
8%
Models
AC
Anthropic: Claude Opus 4.5
anthropic
#1
Survived
81%
Failure Rate
19%
AC
Anthropic: Claude Opus 4.6
anthropic
#2
Survived
80%
Failure Rate
20%
AC
Anthropic: Claude 3.7 Sonnet (thinking)
anthropic
#3
Survived
77%
Failure Rate
23%
AC
Anthropic: Claude Sonnet 4.6
anthropic
#4
Survived
74%
Failure Rate
26%
AC
Anthropic: Claude Opus 4
anthropic
#5
Survived
64%
Failure Rate
36%
AC
Anthropic: Claude Sonnet 4.5
anthropic
#6
Survived
64%
Failure Rate
36%
AC
Anthropic: Claude Opus 4.1
anthropic
#7
Survived
63%
Failure Rate
37%
AC
Anthropic: Claude Haiku 4.5
anthropic
#8
Survived
62%
Failure Rate
38%
AC
Anthropic: Claude 3.7 Sonnet
anthropic
#9
Survived
59%
Failure Rate
41%
AC
Anthropic: Claude Sonnet 4
anthropic
#10
Survived
59%
Failure Rate
41%
AC
Anthropic: Claude 3.5 Sonnet
anthropic
#11
Survived
53%
Failure Rate
47%
AC
Anthropic: Claude 3.5 Haiku
anthropic
#12
Survived
47%
Failure Rate
53%
AC
Anthropic: Claude 3 Haiku
anthropic
#13
Survived
39%
Failure Rate
61%