Item detail
github.com

darkrishabh/agent-skills-eval

RepoRadar surfaced darkrishabh/agent-skills-eval — a mit test runner for anthropic ag — into the darkrishabh/agent-skills-eval is the MIT `agent- section, where it sits at Gold tier with a 'try now' verdict. Its strongest signal is workflow potential, scored 9.9 out of 10.

Score8.4
Popularity602.0
Risknone
TierGold
Score breakdown
Usefulness9.0
Novelty9.0
Momentum8.0
Maturity9.1
Open-source/build8.4
Evidence8.0
Workflow potential9.9
Setup ease8.8

Popularity is tracked separately. Support, ads, sponsorships, and tips never affect these signals.

Why it matters

Useful for **Agent Skills authors who want receipts** — `agent-skills-eval` runs the same prompt twice (with_skill vs without_skill), has a judge model grade both, and produces a side-by-side HTML report so the skill author can prove the SKILL.md actually improves the model's performance rather than just adding noise. Useful for **Claude Code / Codex / OpenClaw / Hermes Agent skill library maintai

Who should use it

**Agent Skills authors who want receipts** — `agent-skills-eval` runs the same prompt twice (with_skill vs without_skill), has a judge model grade both, produces a side-by-side HTML report so the skill author can prove the SKILL.md actually improves the model's performance**Claude Code / Codex / OpenClaw / Hermes Agent skill library maintainers** — the test runner is intentionally runtime-agnostic so a single suite covers all four consumers, with the same workspace layout and the same judge grading**AI-tool teams shipping internal skills** — the `--baseline` flag is the audit trail that says 'here is the model's performance before the skill and here it is after, the receipts are in `report/index.html`'**Skill-library curators** — a skill library that ships with `agent-skills-eval` tests attached can answer 'which skills are net-positive, which are net-neutral, which are net-negative' across the whole library with one CLI run**Researchers studying skill efficacy** — the paired-prompt baseline-subtraction design removes model variance as a confound and surfaces the marginal contribution of the SKILL.md alone**CI gates** — the GitHub Actions workflow wires the test runner into PR checks so a new SKILL.md cannot land unless the with_skill version beats the without_skill baseline by a configurable margin**Judge-model experimentation** — a team can run the same skills with different judges (`gpt-4o-mini`, `claude-sonnet-4-6`, `o3-mini`) to see how judge selection affects the verdictEvaluation: `npm install -g agent-skills-eval` then `npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline --strict`; the docs at darkrishabh.github.io/agent-skills-eval/ walk through the workspace layout

Who should skip it

Skip darkrishabh/agent-skills-eval unless the captured evidence suggests it solves a problem you are actively working on.

About this signal

darkrishabh/agent-skills-eval is tracked by RepoRadar as a mit test runner for anthropic ag in the darkrishabh/agent-skills-eval is the MIT `agent- section. It was first seen on 2026-06-25 and last updated on 2026-06-25. The current verdict is 'try now' with a Gold tier and easy setup difficulty. The standout signals for darkrishabh/agent-skills-eval are workflow potential (9.9) and maturity (9.1), while evidence quality (8.0) trails — that balance shapes where it fits best. This page summarizes the evidence RepoRadar has captured from captured source metadata. The score, tier, risk label, and verdict on this page are never influenced by sponsorship, ads, or tips — they reflect only the usefulness, popularity, novelty, momentum, maturity, and evidence signals described in the RepoRadar methodology.

How this item is evaluated

RepoRadar assigned darkrishabh/agent-skills-eval a composite score of 8.4 out of 10, placing it in the Gold tier. This score combines weighted sub-signals: usefulness (35%), novelty (18%), momentum (14%), maturity (10%), open-source/build quality (7%), evidence quality (6%), workflow potential (6%), and setup ease (4%). Popularity is tracked separately at 602.0 and never affects the composite score or tier. The risk label of 'none' reflects inherent user-impacting hazards, not generic novelty. Items with no risk flag may still require normal code review before production use.

Putting this into practice? Read How to vet an AI agent or MCP server before you wire it in for the checklist behind this score.

Risk explanation

No inherent user-impacting risk is flagged from the captured evidence.

Evidence links

Closest alternatives / related signals

agent-skills-evaldarkrishabhanthropic-agent-skillsagent-skillsagentskills-ioskill-evalskill-testingskill-receipts