Item detail

harbor-framework/harbor

Harbor is an Apache-2.0 framework for running agent evaluations and creating or reusing RL environments. It packages task definitions, datasets, verifiers, and Docker-based sandboxing so teams can run reproducible agent benchmarks and reinforcement-learning training instead of one-off notebooks.

Score8.4
Popularity78.0
Riskconditional
TierGold
Score breakdown
Usefulness8.0
Novelty8.0
Momentum9.0
Maturity8.2
Open-source/build8.4
Evidence7.2
Workflow potential9.5
Setup ease6.4

Popularity is tracked separately. Support, ads, sponsorships, and tips never affect these signals.

Why it matters

Useful for teams that already trust Claude Code, Codex, Aider, or homegrown agents and now need credible, reproducible evals: install Harbor, point it at a small representative task set, and confirm that scoring matches human spot-checks before scaling up RL or CI-gated evals.

Who should use it

agent evaluation teamsapplied RL researchersplatform teamsagent product engineers

Who should skip it

Skip if the source link, docs, or setup requirements do not match your workflow.

Risk explanation

Docker-based sandboxes still need to be scoped and isolated per evaluation run; scoring logic should be reviewed against human spot-checks before being treated as ground truth.

Evidence links

Closest alternatives / related signals

agent-evaluationrlbenchmarksharborsandboxreproducibility