Item detail
github.com

agentscope-ai/PawBench

agentscope-ai/PawBench is a benchmark that RepoRadar is tracking in its Evaluation section, currently rated Gold tier with a 'try now' verdict. Its strongest signal is workflow potential, scored 9.5 out of 10.

Score8.4
Popularity1.0
Risknone
TierGold
Score breakdown
Usefulness8.0
Novelty9.0
Momentum8.0
Maturity6.6
Open-source/build8.4
Evidence7.2
Workflow potential9.5
Setup ease6.4

Popularity is tracked separately. Support, ads, sponsorships, and tips never affect these signals.

Why it matters

Useful for anyone comparing coding agents or agent harnesses who is tired of a single pass-rate number that hides whether the model or the runtime is the weak link.

Who should use it

Researchers and engineers comparing coding agents and agent harnesses head-to-headTeams debugging whether a regression is in the model or in the agent runtimeBenchmark authors looking for a public example of co-evaluation designOpen-source maintainers measuring the impact of harness changes on a fixed model lineup

Who should skip it

Move on from agentscope-ai/PawBench if the licensing terms, language support, or platform requirements do not fit your project.

About this signal

agentscope-ai/PawBench is tracked by RepoRadar as a benchmark in the Evaluation section. It was first seen on 2026-06-30 and last updated on 2026-06-30. The current verdict is 'try now' with a Gold tier and moderate setup difficulty. agentscope-ai/PawBench leads on workflow potential (9.5) and novelty (9.0); its lowest signal is setup ease (6.4), so factor that in before investing setup time. This page summarizes the evidence RepoRadar has captured from captured source metadata. The score, tier, risk label, and verdict on this page are never influenced by sponsorship, ads, or tips — they reflect only the usefulness, popularity, novelty, momentum, maturity, and evidence signals described in the RepoRadar methodology.

How this item is evaluated

RepoRadar assigned agentscope-ai/PawBench a composite score of 8.4 out of 10, placing it in the Gold tier. This score combines weighted sub-signals: usefulness (35%), novelty (18%), momentum (14%), maturity (10%), open-source/build quality (7%), evidence quality (6%), workflow potential (6%), and setup ease (4%). Popularity is tracked separately at 1.0 and never affects the composite score or tier. The risk label of 'none' reflects inherent user-impacting hazards, not generic novelty. Items with no risk flag may still require normal code review before production use.

Putting this into practice? Read How to vet an AI agent or MCP server before you wire it in for the checklist behind this score.

Risk explanation

No inherent user-impacting risk is flagged from the captured evidence.

Evidence links
Closest alternatives / related signals
benchmarkevaluationagentharnessresearchmodel-comparisonapache-2.0