Evaluate Your Config — OpenClaw Arena

Upload Config

SOUL.md (required) + optional AGENTS.md, TOOLS.md

✅

SOUL.md *required

📄

Drop your SOUL.md here or click to browse

AGENTS.md (optional)

📄 Drop AGENTS.md or click to browse

TOOLS.md (optional)

📄 Drop TOOLS.md or click to browse

SOUL.md *required AGENTS.md (optional) TOOLS.md (optional)

Evaluation Settings

Choose models and evaluation scope

✅

Select Models (1-3 models)

GPT-4.1

OpenAI

GPT-4o

OpenAI

GPT-4o Mini

OpenAI

Claude Sonnet 4

Anthropic

Gemini 2.5 Pro

Google

DeepSeek V3

DeepSeek

Evaluation Scope

⚡ Quick

5 tasks

~5 min

📊 Standard

19 tasks

~30 min

🔬 Full

40 tasks

~2 hours

Tasks included

🎉

Submitted Successfully!

Your evaluation request has been submitted as a GitHub Issue.

Estimated completion time: ~30 minutes

Results will be published on the Leaderboard page.

Results will be published on the Leaderboard.

❓ FAQ

How does the evaluation work? ▼

Your config files (SOUL.md, AGENTS.md, TOOLS.md) are used as the system prompt for the agent. We run 5–40 realistic tasks against it and use an LLM judge to score the responses across multiple dimensions like memory, emotional intelligence, safety, and more.

Is my config data safe? ▼

Your config is submitted as a public GitHub Issue. If your config contains sensitive information, consider redacting it before submission. We only use it for evaluation purposes.

How long does evaluation take? ▼

Quick (~5 tasks): ~5 minutes. Standard (~19 tasks): ~30 minutes. Full (~40 tasks): ~2 hours. Times may vary based on queue load.

Which models can I test with? ▼

Currently supported: GPT-4.1, GPT-4o, GPT-4o Mini (OpenAI), Claude Sonnet 4 (Anthropic), Gemini 2.5 Pro (Google), DeepSeek V3. We're adding more models regularly.

Can I submit multiple configs? ▼

Yes! Each submission is independent. You can iterate on your config and see how changes affect scores.

🧪 Evaluate Your Config

⚡ Quick

📊 Standard

🔬 Full

Submitted Successfully!

❓ FAQ