Score breakdown
Popularity is tracked separately. Support, ads, sponsorships, and tips never affect these signals.
Why it matters
Useful for AI search-agent builders, evaluation teams, and applied-AI researchers who need an MIT-licensed, open-source search-agent evaluation benchmark that targets the hardest realistic search workloads in the wild — vague, multi-turn, persona-driven, proactive — with 200 long-horizon tasks scored by verifiable outcomes, so they can measure real-world search-agent quality instead of relying on
Who should use it
Who should skip it
Skip if the source link, docs, or setup requirements do not match your workflow.
Risk explanation
It is an open-source search-agent evaluation benchmark with 200 long-horizon tasks scored by verifiable outcomes, so review the task suite for any prompts that may be unsafe to run against production search agents, confirm the verifiable-outcome scoring logic is appropriate for your evaluation goals, and audit the benchmark's task sources before treating the score as a public quality signal.