env::bananazon_parts_v1

Bananazon

A real-world simulation of a PC components company'scustomer support environment.

Scroll to explore

Inside the Environment

A complete simulation - data, users and interactions.

A Complete Company Simulation

Every entity an agent might need to query

Customers

100+ records

Customer profiles with loyalty tiers (Standard → Platinum) and purchase history

Sarah Kim (Gold)Penny Whitcomb (Platinum)

Products

30+ records

High-performance PC components with detailed technical specifications and compatibility data

RTX 4080 SuperStorm 6600X CPU

Orders

300+ records

Order records with status tracking, line items, discounts, and fulfillment dates

PendingShipped

Support Tickets

80+ records

Customer issues with priority levels, categories, and resolution status

CompatibilityShipping delays
Benchmarks

Model Performance in Bananzon

Success rate across 56 support scenarios

Success Rate

Percentage of tasks completed successfully

Claude Opus 4.5Anthropic
57.1%
GPT-5.2OpenAI
55.4%
Claude Sonnet 4.5Anthropic
50%
Kimi K2Moonshot
46.4%
DeepSeek V3.2DeepSeek
41.1%

Performance by Difficulty

Success rates by task difficulty level

ModelSimpleMediumHardComplex
Claude Opus 4.5Anthropic
80.0%71.4%35.7%38.5%
GPT-5.2OpenAI
60.0%64.3%64.3%30.8%
Claude Sonnet 4.5Anthropic
73.3%57.1%57.1%7.7%
Kimi K2Moonshot
66.7%57.1%35.7%23.1%
DeepSeek V3.2DeepSeek
73.3%57.1%21.4%7.7%

Failure Mode Distribution

For a given failure mode, what share of each model’s failures does it explain?

🔧 Tool Use
DeepSeek V3.2 · DeepSeek
9 (28.1%) / 32
GPT-5.2 · OpenAI
7 (28.0%) / 25
Kimi K2 · Moonshot
4 (13.3%) / 30
Claude Sonnet 4.5 · Anthropic
1 (5.6%) / 18
Claude Opus 4.5 · Anthropic
1 (4.2%) / 24