A real-world simulation of a PC components company's
customer support environment.
A complete simulation - data, users and interactions.
Every entity an agent might need to query
Customer profiles with loyalty tiers (Standard → Platinum) and purchase history
High-performance PC components with detailed technical specifications and compatibility data
Order records with status tracking, line items, discounts, and fulfillment dates
Customer issues with priority levels, categories, and resolution status
Success rate across 56 support scenarios
Percentage of tasks completed successfully
Success rates by task difficulty level
| Model | Simple | Medium | Hard | Complex |
|---|---|---|---|---|
Claude Opus 4.5Anthropic | 80.0% | 71.4% | 35.7% | 38.5% |
GPT-5.2OpenAI | 60.0% | 64.3% | 64.3% | 30.8% |
Claude Sonnet 4.5Anthropic | 73.3% | 57.1% | 57.1% | 7.7% |
Kimi K2Moonshot | 66.7% | 57.1% | 35.7% | 23.1% |
DeepSeek V3.2DeepSeek | 73.3% | 57.1% | 21.4% | 7.7% |
For a given failure mode, what share of each model’s failures does it explain?