Qwen: Qwen3.5 397B A17B
by qwen
We throw tricky but funny prompts at top AI models and watch them squirm. Count letters. Flip cups. Cite imaginary dolphins.
Nobody passes clean.
Models ranked by how often they survive our challenge sets.
by qwen
by anthropic
by anthropic
by google
by anthropic
See the full ranking and stats
Prompt suites engineered to expose common model failure modes.
Tests ability to follow multiple detailed instructions simultaneously.
Models are sycophantic — they assume every question has a valid answer and invent plausible-sounding explanations for each, even when the premises are mutually exclusive.
Tests self-awareness and recursive reasoning. Model must count letters in its own response.
It's a car wash — you need to bring your car to wash it. The short distance is a red herring; models fixate on the 100m and recommend walking, forgetting the entire purpose of the trip.
No number from 1 to 999 contains the letter 'a' when spelled out in English. The first number with an 'a' is 'one thousand'. Models confidently hallucinate answers like 'eight' or 'one hundred and'.
Browse the complete test catalog
Failure-rate snapshot by provider (averaged across their models).
| Provider | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Nvidia | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% |
| Mistralai | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 100% | 0% | 0% | 0% | 0% |
| Xiaomi | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 0% | 0% | 0% | 0% |
| Deepseek | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 0% | 0% | 0% | 0% | 0% |
| Baidu | 100% | 100% | 100% | 100% | 100% | 100% | 0% | 0% | 0% | 100% | 100% | 0% | 100% | 0% | 0% |
| 89% | 89% | 100% | 78% | 78% | 78% | 11% | 44% | 78% | 56% | 44% | 44% | 22% | 11% | 11% | |
| Prime-intellect | 100% | 100% | 100% | 100% | 0% | 0% | 100% | 100% | 0% | 100% | 100% | 0% | 0% | 0% | 0% |
| Anthropic | 92% | 92% | 85% | 92% | 85% | 69% | 85% | 54% | 15% | 15% | 8% | 8% | 15% | 8% | 15% |
| Arcee-ai | 100% | 100% | 0% | 100% | 100% | 0% | 100% | 100% | 0% | 100% | 0% | 0% | 0% | 0% | 0% |
| Moonshotai | 100% | 100% | 100% | 100% | 0% | 0% | 0% | 100% | 0% | 0% | 100% | 0% | 0% | 0% | 0% |
| Bytedance-seed | 100% | 100% | 100% | 100% | 100% | 0% | 0% | 0% | 100% | 0% | 0% | 0% | 0% | 0% | 0% |
| X-ai | 100% | 100% | 50% | 0% | 50% | 0% | 50% | 50% | 100% | 0% | 50% | 0% | 0% | 0% | 0% |
| Minimax | 100% | 50% | 50% | 100% | 100% | 50% | 0% | 50% | 0% | 0% | 50% | 0% | 0% | 0% | 0% |
| Openai | 100% | 89% | 56% | 78% | 78% | 33% | 56% | 22% | 33% | 0% | 0% | 0% | 0% | 0% | 0% |
| Z-ai | 100% | 100% | 100% | 0% | 100% | 0% | 0% | 0% | 100% | 0% | 0% | 0% | 0% | 0% | 0% |
| Qwen | 100% | 0% | 100% | 0% | 100% | 0% | 0% | 0% | 100% | 0% | 0% | 0% | 0% | 0% | 0% |
Submit your edge case. If it breaks major models, we add it to the gauntlet and credit the submission.
Submit Challenge