weavebench/WeaveBench: AI tool review & score

Score7.7

Popularity1.0

Riskconditional

TierSilver

Score breakdown

Usefulness7.0

Novelty8.0

Momentum5.0

Maturity5.2

Open-source/build8.4

Evidence8.0

Workflow potential8.5

Setup ease4.2

Popularity is tracked separately. Support, ads, sponsorships, and tips never affect these signals.

Why it matters

Useful for agent teams that need harder, more realistic browser-plus-terminal evaluation than toy benchmarks, even if the full setup is too heavy for casual users.

Who should use it

Evaluation teams testing computer-use agents beyond toy tasksAgent platform engineers comparing harnesses like Codex, Claude Code, OpenClaw, and HermesResearchers studying long-horizon browser-plus-terminal behaviorBuilders who need trajectory-aware judging rather than file-existence scoring

Who should skip it

Skip weavebench/WeaveBench for now if your priority is a tool you can use today without configuring a build pipeline or development environment.

About this signal

weavebench/WeaveBench is tracked by RepoRadar as a benchmarking in the Developer Tools section. It was first seen on 2026-06-28 and last updated on 2026-06-28. The current verdict is 'worth watch' with a Silver tier and hard setup difficulty. The standout signals for weavebench/WeaveBench are workflow potential (8.5) and open-source/build quality (8.4), while setup ease (4.2) trails — that balance shapes where it fits best. This page summarizes the evidence RepoRadar has captured from captured source metadata. The score, tier, risk label, and verdict on this page are never influenced by sponsorship, ads, or tips — they reflect only the usefulness, popularity, novelty, momentum, maturity, and evidence signals described in the RepoRadar methodology.

How this item is evaluated

RepoRadar assigned weavebench/WeaveBench a composite score of 7.7 out of 10, placing it in the Silver tier. This score combines weighted sub-signals: usefulness (35%), novelty (18%), momentum (14%), maturity (10%), open-source/build quality (7%), evidence quality (6%), workflow potential (6%), and setup ease (4%). Popularity is tracked separately at 1.0 and never affects the composite score or tier. The risk label of 'conditional' reflects inherent user-impacting hazards, not generic novelty. Items with no risk flag may still require normal code review before production use.

Putting this into practice? Read How to read AI benchmarks without getting fooled for the checklist behind this score.

Risk explanation

Full runs need KVM, Docker, roughly 30 GB of assets, and an OpenRouter API key, so treat it as a serious evaluation lab setup rather than a lightweight benchmark for any laptop; The offline demo shows the judging format, but the real value comes from multi-hour VM-backed runs and billable judge plus agent calls, so budget cost and infrastructure before adoption.

Evidence links

github.com

Closest alternatives / related signals

agent-evalscomputer-usebenchmarkingdeveloper-toolspythonmit