Dual-Leaderboard Benchmark for Personal AI Agents
Evaluating 13 frontier models across 100 real-world tasks โ measuring task competence and personality consistency for personal AI agents.
A research benchmark by CUHK VNLab ยท Not a game โ a scientific evaluation framework
| Rank | Model | Avg Score | Score | Tasks |
|---|
Same model, different SOUL.md โ how much does config matter?
| Rank | Config | Avg Score | Score | Tasks |
|---|
๐ Config Impact: Top config scores 14.4% higher than the weakest (0.870 vs 0.761)
Even the baseline config (0.801) underperforms 8 out of 10 custom configs
Classifying AI agent personalities across 4 cognitive dimensions โ beyond task scores.
Does the agent anticipate or wait?
Rigid plans or adaptive flow?
Over-explains or gets to the point?
Reads emotions or sticks to facts?