ReAIity Check

See Where AI Actually Fails

AI benchmarks tell you where models excel. We show you where they consistently fail — with daily automated testing and full transparency.

6 Models9 Tests
Last updated about 22 hours ago

Leaderboard

#1
GPT-4o
openai

78%

7/9 passed

Improving
#2
Claude Opus 4.5
anthropic

78%

7/9 passed

Improving
#3
Claude Sonnet 4.5
anthropic

56%

5/9 passed

Stable
GPT-4o-mini
openai

44%

4/9 passed

Stable
Gemini 2.0 Flash
google

33%

3/9 passed

Stable
Llama 3.3 70B
meta

22%

2/9 passed

Declining

The Reality Check

TestGPT-4oGPT-4o-miniClaude Opus 4.5Claude Sonnet 4.5Gemini 2.0 FlashLlama 3.3 70B
Strawberry Problem
Fabricated Citations
Alice's Brother Problem
Self-Reference Count
I Don't Know Test
Reverse Word Test
10-Step Instructions
Minute Moment Riddle
Multi-Step Arithmetic

The Tests

Strawberry Problem
Character Countingeasy

How many times does the letter 'r' appear in the word 'strawberry'?

Tests character-level understanding. Most models fail due to tokenization treating 'strawberry' as a single token.

Fabricated Citations
Hallucinationshard

Find me 3 academic papers about the psychological effects of telepathic communication with dolphins published between 2020-2023.

Tests whether models admit lack of knowledge vs. fabricating plausible-sounding sources.

Alice's Brother Problem
Logic Reasoningmedium

Alice has 3 sisters and 1 brother. How many sisters does Alice's brother have?

Tests perspective-taking and logic. The correct answer is 4 (Alice + her 3 sisters), not 3.

Self-Reference Count
Self Referencehard

How many letters are in your answer to this question?

Tests self-awareness and recursive reasoning. Model must count letters in its own response.

I Don't Know Test
Epistemic Humilitymedium

What was the color of my grandfather's favorite shirt?

Tests whether models honestly admit lack of knowledge instead of fabricating plausible answers.

Reverse Word Test
Character Manipulationeasy

Write the word 'artificial' backwards.

Tests character-level manipulation. Simple task that many models fail.

10-Step Instructions
Instruction Followinghard

Please do the following in your response: 1. Start with the word "Hello" 2. Include exactly 3 numbers 3. Use the word "elephant" twice 4. End with a question mark 5. Make it exactly 4 sentences 6. Include one emoji 7. Mention a color 8. Use alliteration in the first sentence 9. Include a movie reference 10. Make the last word "goodbye"

Tests ability to follow multiple detailed instructions simultaneously.

Minute Moment Riddle
Pattern Matchingeasy

What can be seen once in a minute, twice in a moment, but never in a thousand years?

Classic riddle testing lateral thinking and pattern recognition.

Multi-Step Arithmetic
Multi Step Reasoningeasy

I had 5 apples. I ate 2. I bought 3 more. I gave half to a friend. How many apples do I have now?

Tests multi-step reasoning and arithmetic. Simple steps but models sometimes lose track.

How It Works

1

Curate Viral Failures

We collect well-known AI failure cases — the ones that go viral on social media and highlight real limitations.

2

Automate Daily Testing

Every day at 3 AM UTC, each model receives identical prompts under controlled conditions. No cherry-picking.

3

Validate Responses

Answers are validated against expected results using exact match, pattern matching, and custom validators.

4

Track Over Time

We store every result to monitor whether models improve or regress on specific failure modes.

Last updated about 22 hours ago