🦞 OpenClaw Arena

Dual-Leaderboard Benchmark for Personal AI Agents

Evaluating 13 frontier models across 100 real-world tasks — measuring task competence and personality consistency for personal AI agents.

A research benchmark by CUHK VNLab · Not a game — a scientific evaluation framework

🏆 Model Leaderboard

Rank	Model	Avg Score	Score	Tasks

Same model, different SOUL.md — how much does config matter?

Rank	Config	Avg Score	Score	Tasks

📊 Config Impact: Top config scores 14.4% higher than the weakest (0.870 vs 0.761)

Even the baseline config (0.801) underperforms 8 out of 10 custom configs

Classifying AI agent personalities across 4 cognitive dimensions — beyond task scores.

Does the agent anticipate or wait?

Rigid plans or adaptive flow?

Over-explains or gets to the point?

Reads emotions or sticks to facts?