100 Tasks Β· 10 Categories Β· Personal AI Agent Evaluation
Each task simulates multi-session, multi-day interactions β testing memory, judgment, safety, and real-world competence.