๐Ÿฆž OpenClaw Arena

Dual-Leaderboard Benchmark for Personal AI Agents

Evaluating 13 frontier models across 100 real-world tasks โ€” measuring task competence and personality consistency for personal AI agents.

A research benchmark by CUHK VNLab ยท Not a game โ€” a scientific evaluation framework

๐Ÿ† Model Leaderboard

Rank Model Avg Score Score Tasks

๐Ÿ‘ค Config Leaderboard (gpt-4.1)

Same model, different SOUL.md โ€” how much does config matter?

Rank Config Avg Score Score Tasks

๐Ÿ“Š Config Impact: Top config scores 14.4% higher than the weakest (0.870 vs 0.761)

Even the baseline config (0.801) underperforms 8 out of 10 custom configs

๐Ÿ”ฅ Per-Task Heatmap

๐Ÿ“Š Category Radar

๐Ÿ’ก Key Findings

๐Ÿ“ 7 Evaluation Dimensions

Coming Soon: Agent MBTI Test ๐Ÿง 

Classifying AI agent personalities across 4 cognitive dimensions โ€” beyond task scores.

Proactive โ†” Reactive

Does the agent anticipate or wait?

Structured โ†” Flexible

Rigid plans or adaptive flow?

Verbose โ†” Concise

Over-explains or gets to the point?

Empathetic โ†” Analytical

Reads emotions or sticks to facts?