FreedomIntelligence/gamecraft-bench: AI tool

Score7.5

Popularity1.0

Risknone

TierSilver

Score breakdown

Usefulness7.0

Novelty8.0

Momentum6.0

Maturity5.5

Open-source/build8.4

Evidence7.2

Workflow potential8.3

Setup ease4.2

Popularity is tracked separately. Support, ads, sponsorships, and tips never affect these signals.

Why it matters

Useful for agent-eval teams that want a harder end-to-end benchmark than unit-test patching, especially when they care about whether an agent can coordinate code, scenes, assets, runtime behavior, and playable output as one working system.

Who should use it

Researchers evaluating end-to-end coding-agent capability beyond benchmark patchesDeveloper-tool teams that need a harder acceptance test for agent orchestration stacksBuilders working with Godot or other interactive app-generation workflowsReaders tracking whether agent benchmarks are getting closer to product-like tasks

Who should skip it

Skip FreedomIntelligence/gamecraft-bench unless the captured evidence suggests it solves a problem you are actively working on.

About this signal

FreedomIntelligence/gamecraft-bench is tracked by RepoRadar as a benchmark in the Agent Evaluation section. It was first seen on 2026-07-01 and last updated on 2026-07-01. The current verdict is 'track' with a Silver tier and advanced setup difficulty. Across RepoRadar's eight signals, FreedomIntelligence/gamecraft-bench is strongest on open-source/build quality (8.4) and workflow potential (8.3) and weakest on setup ease (4.2) — a profile worth weighing against your own priorities. This page summarizes the evidence RepoRadar has captured from captured source metadata. The score, tier, risk label, and verdict on this page are never influenced by sponsorship, ads, or tips — they reflect only the usefulness, popularity, novelty, momentum, maturity, and evidence signals described in the RepoRadar methodology.

How this item is evaluated

RepoRadar assigned FreedomIntelligence/gamecraft-bench a composite score of 7.5 out of 10, placing it in the Silver tier. This score combines weighted sub-signals: usefulness (35%), novelty (18%), momentum (14%), maturity (10%), open-source/build quality (7%), evidence quality (6%), workflow potential (6%), and setup ease (4%). Popularity is tracked separately at 1.0 and never affects the composite score or tier. The risk label of 'none' reflects inherent user-impacting hazards, not generic novelty. Items with no risk flag may still require normal code review before production use.

Putting this into practice? Read How to vet an AI agent or MCP server before you wire it in for the checklist behind this score.

Risk explanation

It is a benchmark and evaluation harness rather than a direct end-user product, so the immediate value is highest for research and tooling teams; Results depend on the verifier and hidden judging rubric, so treat headline benchmark scores as comparative evidence rather than absolute capability truth.

Evidence links

github.com

Closest alternatives / related signals

benchmarkagent-evaluationgodotcoding-agentresearchapache-2.0