Item detail

VibeBench/VibeSearchBench

VibeBench/VibeSearchBench is an MIT-licensed, open-source search-agent evaluation benchmark from VibeBench that targets the hardest realistic search workloads in the wild — vague, multi-turn, persona-driven, proactive — with 200 long-horizon tasks scored by verifiable outcomes, so AI search-agent builders and evaluation teams can measure real-world search-agent quality instead of relying on toy ke

Score7.4
Popularity7.0
Risklow
TierSilver
Score breakdown
Usefulness7.0
Novelty9.0
Momentum7.0
Maturity5.6
Open-source/build8.4
Evidence7.2
Workflow potential8.2
Setup ease6.4

Popularity is tracked separately. Support, ads, sponsorships, and tips never affect these signals.

Why it matters

Useful for AI search-agent builders, evaluation teams, and applied-AI researchers who need an MIT-licensed, open-source search-agent evaluation benchmark that targets the hardest realistic search workloads in the wild — vague, multi-turn, persona-driven, proactive — with 200 long-horizon tasks scored by verifiable outcomes, so they can measure real-world search-agent quality instead of relying on

Who should use it

AI search-agent builders who need an MIT-licensed, open-source search-agent benchmark with vague, multi-turn, persona-driven, proactive tasks scored by verifiable outcomesevaluation teams who want a single search-agent benchmark that reflects real-world search workloads instead of toy keyword queriesapplied-AI researchers who need a 200-task long-horizon search-agent benchmark to compare new search-agent designs against the current state of the artopen-source contributors who want an MIT-licensed alternative to closed-source, vendor-locked search-agent evaluation benchmarks

Who should skip it

Skip if the source link, docs, or setup requirements do not match your workflow.

Risk explanation

It is an open-source search-agent evaluation benchmark with 200 long-horizon tasks scored by verifiable outcomes, so review the task suite for any prompts that may be unsafe to run against production search agents, confirm the verifiable-outcome scoring logic is appropriate for your evaluation goals, and audit the benchmark's task sources before treating the score as a public quality signal.

Evidence links

Closest alternatives / related signals

benchmarksearch-agentagentic-aievaluationlong-horizonopen-sourcemit