ReAIty Check — Know your AI tools' limitations

Models are sycophantic — they assume every question has a valid answer and invent plausible-sounding explanations for each, even when the premises are mutually exclusive.

Kill Rate89%

Self-Reference Count

Self Reference

Tests self-awareness and recursive reasoning. Model must count letters in its own response.

Kill Rate81%

Car Wash Dilemma

Logic Reasoning

It's a car wash — you need to bring your car to wash it. The short distance is a red herring; models fixate on the 100m and recommend walking, forgetting the entire purpose of the trip.

Kill Rate80%

The Missing A

Pattern Matching

No number from 1 to 999 contains the letter 'a' when spelled out in English. The first number with an 'a' is 'one thousand'. Models confidently hallucinate answers like 'eight' or 'one hundred and'.

Kill Rate80%

View All Challenges

Browse the complete test catalog

Benchmark

Providers Performance

Failure-rate snapshot by provider (averaged across their models).

Provider
Nvidia	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	0%
Mistralai	100%	100%	100%	100%	100%	100%	100%	100%	100%	0%	100%	0%	0%	0%	0%
Xiaomi	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	0%	0%	0%	0%	0%
Deepseek	100%	100%	100%	100%	100%	100%	100%	100%	100%	0%	0%	0%	0%	0%	0%
Baidu	100%	100%	100%	100%	100%	100%	0%	0%	0%	100%	100%	0%	100%	0%	0%
Google	89%	89%	100%	78%	78%	78%	11%	44%	78%	56%	44%	44%	22%	11%	11%
Prime-intellect	100%	100%	100%	100%	0%	0%	100%	100%	0%	100%	100%	0%	0%	0%	0%
Anthropic	92%	92%	85%	92%	85%	69%	85%	54%	15%	15%	8%	8%	15%	8%	15%
Arcee-ai	100%	100%	0%	100%	100%	0%	100%	100%	0%	100%	0%	0%	0%	0%	0%
Moonshotai	100%	100%	100%	100%	0%	0%	0%	100%	0%	0%	100%	0%	0%	0%	0%
Bytedance-seed	100%	100%	100%	100%	100%	0%	0%	0%	100%	0%	0%	0%	0%	0%	0%
X-ai	100%	100%	50%	0%	50%	0%	50%	50%	100%	0%	50%	0%	0%	0%	0%
Minimax	100%	50%	50%	100%	100%	50%	0%	50%	0%	0%	50%	0%	0%	0%	0%
Openai	100%	89%	56%	78%	78%	33%	56%	22%	33%	0%	0%	0%	0%	0%	0%
Z-ai	100%	100%	100%	0%	100%	0%	0%	0%	100%	0%	0%	0%	0%	0%	0%
Qwen	100%	0%	100%	0%	100%	0%	0%	0%	100%	0%	0%	0%	0%	0%	0%

Full benchmark

Have a tricky prompt?

Submit your edge case. If it breaks major models, we add it to the gauntlet and credit the submission.

Submit Challenge

Model Eliminated

Where AI Models Face Reality

Top Survivors

Qwen: Qwen3.5 397B A17B

Anthropic: Claude Opus 4.5

Anthropic: Claude Opus 4.6

Google: Gemini 3 Pro Preview

Anthropic: Claude 3.7 Sonnet (thinking)

View benchmark table

Deadly Challenges

10-Step Instructions

Contradictory Premises