env::bananazon_parts_v1

Bananazon

A real-world simulation of a PC components company's
customer support environment.

Scroll to explore

Inside the Environment

A complete simulation - data, users and interactions.

A Complete Company Simulation

Every entity an agent might need to query

Customers

100+ records

Customer profiles with loyalty tiers (Standard → Platinum) and purchase history

Sarah Kim (Gold)Penny Whitcomb (Platinum)

Products

30+ records

High-performance PC components with detailed technical specifications and compatibility data

RTX 4080 SuperStorm 6600X CPU

Orders

300+ records

Order records with status tracking, line items, discounts, and fulfillment dates

PendingShipped

Support Tickets

80+ records

Customer issues with priority levels, categories, and resolution status

CompatibilityShipping delays

Benchmarks

Model Performance in Bananzon

Success rate across 56 support scenarios

Success Rate

Percentage of tasks completed successfully

Claude Opus 4.5Anthropic

57.1%

GPT-5.2OpenAI

55.4%

Claude Sonnet 4.5Anthropic

50%

Kimi K2Moonshot

46.4%

DeepSeek V3.2DeepSeek

41.1%

Performance by Difficulty

Success rates by task difficulty level

Model	Simple	Medium	Hard	Complex
Claude Opus 4.5Anthropic	80.0%	71.4%	35.7%	38.5%
GPT-5.2OpenAI	60.0%	64.3%	64.3%	30.8%
Claude Sonnet 4.5Anthropic	73.3%	57.1%	57.1%	7.7%
Kimi K2Moonshot	66.7%	57.1%	35.7%	23.1%
DeepSeek V3.2DeepSeek	73.3%	57.1%	21.4%	7.7%

Failure Mode Distribution

For a given failure mode, what share of each model’s failures does it explain?

🔧 Tool Use

DeepSeek V3.2 · DeepSeek

9 (28.1%) / 32

GPT-5.2 · OpenAI

7 (28.0%) / 25

Kimi K2 · Moonshot

4 (13.3%) / 30

Claude Sonnet 4.5 · Anthropic

1 (5.6%) / 18

Claude Opus 4.5 · Anthropic

1 (4.2%) / 24