harbor-framework/harbor

Score8.4

Popularity78.0

Riskconditional

TierGold

Score breakdown

Usefulness8.0

Novelty8.0

Momentum9.0

Maturity8.2

Open-source/build8.4

Evidence7.2

Workflow potential9.5

Setup ease6.4

Popularity is tracked separately. Support, ads, sponsorships, and tips never affect these signals.

Why it matters

Useful for teams that already trust Claude Code, Codex, Aider, or homegrown agents and now need credible, reproducible evals: install Harbor, point it at a small representative task set, and confirm that scoring matches human spot-checks before scaling up RL or CI-gated evals.

Who should use it

agent evaluation teamsapplied RL researchersplatform teamsagent product engineers

Who should skip it

Skip if the source link, docs, or setup requirements do not match your workflow.

Risk explanation

Docker-based sandboxes still need to be scoped and isolated per evaluation run; scoring logic should be reviewed against human spot-checks before being treated as ground truth.

Evidence links

github.com

Closest alternatives / related signals

agent-evaluationrlbenchmarksharborsandboxreproducibility