anthropics/scone-bench: AI tool review & score

Score8.2

Popularity1.0

Riskconditional

TierGold

Score breakdown

Usefulness8.0

Novelty9.0

Momentum7.0

Maturity6.1

Open-source/build8.4

Evidence7.2

Workflow potential9.3

Setup ease4.2

Popularity is tracked separately. Support, ads, sponsorships, and tips never affect these signals.

Why it matters

Useful for agent researchers and security-tool builders who want a verifiable, reproducible smart-contract exploitation benchmark without risking mainnet funds.

Who should use it

Agent researchers working on long-horizon, multi-step tool-use tasks with real stakesSecurity-tool builders evaluating AI components for smart-contract audit and exploit workflowsAnthropic API users comparing Claude model families on adversarial reasoningBenchmarks authors looking for a well-engineered, anti-cheat reference implementation

Who should skip it

Hold off on anthropics/scone-bench if the setup requirements exceed what your current workflow or team can support without dedicated engineering time.

About this signal

anthropics/scone-bench is tracked by RepoRadar as a benchmark in the Evaluation section. It was first seen on 2026-06-30 and last updated on 2026-06-30. The current verdict is 'try now' with a Gold tier and hard setup difficulty. anthropics/scone-bench leads on workflow potential (9.3) and novelty (9.0); its lowest signal is setup ease (4.2), so factor that in before investing setup time. This page summarizes the evidence RepoRadar has captured from captured source metadata. The score, tier, risk label, and verdict on this page are never influenced by sponsorship, ads, or tips — they reflect only the usefulness, popularity, novelty, momentum, maturity, and evidence signals described in the RepoRadar methodology.

How this item is evaluated

RepoRadar assigned anthropics/scone-bench a composite score of 8.2 out of 10, placing it in the Gold tier. This score combines weighted sub-signals: usefulness (35%), novelty (18%), momentum (14%), maturity (10%), open-source/build quality (7%), evidence quality (6%), workflow potential (6%), and setup ease (4%). Popularity is tracked separately at 1.0 and never affects the composite score or tier. The risk label of 'conditional' reflects inherent user-impacting hazards, not generic novelty. Items with no risk flag may still require normal code review before production use.

Putting this into practice? Read How to read AI benchmarks without getting fooled for the checklist behind this score.

Risk explanation

Anthropic's README explicitly says 'Not maintained and not accepting contributions,' so plan for upstream breakage against newer anvil / Foundry / Rust toolchains; The benchmark uses real-world exploit patterns from DeFiHackLabs; results reflect agent capability on adversarial tasks, not general coding ability.

Evidence links

github.com

Closest alternatives / related signals

benchmarksmart-contractsecurityanthropicagentevaluationdefiapache-2.0