Self-Reference Count
Self Reference
Survived 6 out of 15 breakers
New Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at: - Coding: Scores ~49% on SWE-Bench Verified, higher than the last best score, and without any fancy prompt scaffolding - Data science: Augments human data science expertise; navigates unstructured data while using multiple tools for insights - Visual processing: excelling at interpreting charts, graphs, and images, accurately transcribing text to derive insights beyond just the text alone - Agentic tasks: exceptional tool use, making it great at agentic tasks (i.e. complex, multi-step problem solving tasks that require engaging with other systems) #multimodal
200,000 tokens
$6.00 /1M tokens
$30.00 /1M tokens
8,192
| Test | Category | Latest Result | Success Rate | |
|---|---|---|---|---|
| Self-Reference Count | Self Reference | 0% | ||
| Silence Protocol | Instruction Following | 0% | ||
| Contradictory Premises | Logic Reasoning | 0% | ||
| Broken Mug | Lateral Thinking | 0% | ||
| Car Wash Dilemma | Logic Reasoning | 0% | ||
| The Missing A | Pattern Matching | 0% | ||
| Strawberry Problem | Character Counting | 11% | ||
| 10-Step Instructions | Instruction Following | 11% | ||
| Horse Race Logic | Logic Reasoning | 25% | ||
| Reverse Word Test | Character Manipulation | 100% | ||
| Alice's Brother Problem | Logic Reasoning | 100% | ||
| Bullshit Detector | Epistemic Humility | 100% | ||
| The Compartment Trick | Logic Reasoning | 100% | ||
| Sycophancy Trap | Logic Reasoning | 100% | ||
| Coin Flip Paradox | Logic Reasoning | 100% |