VibeBench/VibeSearchBench

Score7.4

Popularity7.0

Risklow

TierSilver

Score breakdown

Usefulness7.0

Novelty9.0

Momentum7.0

Maturity5.6

Open-source/build8.4

Evidence7.2

Workflow potential8.2

Setup ease6.4

Popularity is tracked separately. Support, ads, sponsorships, and tips never affect these signals.

Why it matters

Useful for AI search-agent builders, evaluation teams, and applied-AI researchers who need an MIT-licensed, open-source search-agent evaluation benchmark that targets the hardest realistic search workloads in the wild — vague, multi-turn, persona-driven, proactive — with 200 long-horizon tasks scored by verifiable outcomes, so they can measure real-world search-agent quality instead of relying on

Who should use it

AI search-agent builders who need an MIT-licensed, open-source search-agent benchmark with vague, multi-turn, persona-driven, proactive tasks scored by verifiable outcomesevaluation teams who want a single search-agent benchmark that reflects real-world search workloads instead of toy keyword queriesapplied-AI researchers who need a 200-task long-horizon search-agent benchmark to compare new search-agent designs against the current state of the artopen-source contributors who want an MIT-licensed alternative to closed-source, vendor-locked search-agent evaluation benchmarks

Who should skip it

Skip if the source link, docs, or setup requirements do not match your workflow.

Risk explanation

It is an open-source search-agent evaluation benchmark with 200 long-horizon tasks scored by verifiable outcomes, so review the task suite for any prompts that may be unsafe to run against production search agents, confirm the verifiable-outcome scoring logic is appropriate for your evaluation goals, and audit the benchmark's task sources before treating the score as a public quality signal.

Evidence links

github.com

Closest alternatives / related signals

benchmarksearch-agentagentic-aievaluationlong-horizonopen-sourcemit