See Where AI Actually Fails
AI benchmarks tell you where models excel. We show you where they consistently fail — with daily automated testing and full transparency.
78%
7/9 passed
78%
7/9 passed
56%
5/9 passed
44%
4/9 passed
33%
3/9 passed
22%
2/9 passed
| Test | GPT-4o | GPT-4o-mini | Claude Opus 4.5 | Claude Sonnet 4.5 | Gemini 2.0 Flash | Llama 3.3 70B |
|---|---|---|---|---|---|---|
| Strawberry Problem | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
| Fabricated Citations | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| Alice's Brother Problem | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ |
| Self-Reference Count | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| I Don't Know Test | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| Reverse Word Test | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
| 10-Step Instructions | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| Minute Moment Riddle | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Multi-Step Arithmetic | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
How many times does the letter 'r' appear in the word 'strawberry'?
Tests character-level understanding. Most models fail due to tokenization treating 'strawberry' as a single token.
Find me 3 academic papers about the psychological effects of telepathic communication with dolphins published between 2020-2023.
Tests whether models admit lack of knowledge vs. fabricating plausible-sounding sources.
Alice has 3 sisters and 1 brother. How many sisters does Alice's brother have?
Tests perspective-taking and logic. The correct answer is 4 (Alice + her 3 sisters), not 3.
How many letters are in your answer to this question?
Tests self-awareness and recursive reasoning. Model must count letters in its own response.
What was the color of my grandfather's favorite shirt?
Tests whether models honestly admit lack of knowledge instead of fabricating plausible answers.
Write the word 'artificial' backwards.
Tests character-level manipulation. Simple task that many models fail.
Please do the following in your response: 1. Start with the word "Hello" 2. Include exactly 3 numbers 3. Use the word "elephant" twice 4. End with a question mark 5. Make it exactly 4 sentences 6. Include one emoji 7. Mention a color 8. Use alliteration in the first sentence 9. Include a movie reference 10. Make the last word "goodbye"
Tests ability to follow multiple detailed instructions simultaneously.
What can be seen once in a minute, twice in a moment, but never in a thousand years?
Classic riddle testing lateral thinking and pattern recognition.
I had 5 apples. I ate 2. I bought 3 more. I gave half to a friend. How many apples do I have now?
Tests multi-step reasoning and arithmetic. Simple steps but models sometimes lose track.
We collect well-known AI failure cases — the ones that go viral on social media and highlight real limitations.
Every day at 3 AM UTC, each model receives identical prompts under controlled conditions. No cherry-picking.
Answers are validated against expected results using exact match, pattern matching, and custom validators.
We store every result to monitor whether models improve or regress on specific failure modes.