Helloo HF!
First time posting here!
Sharing some early results from CHINI-bench, a small public benchmark I just finished building couple days ago. It asks LLMs to design distributed systems as graphs (components, behaviors, edges), then runs the resulting architecture through a discrete-event simulator under stress scenarios. Scoring is mechanical, no LLM-as-judge and no human in the loop. Same simulator, same math, every model.
This is preliminary: 30 problems, four frontier models, single seed per problem. The N is scoped on purpose. Scoring is fully deterministic given the canvas, so a single run per cell carries real signal. The reason to call it preliminary is the model coverage (only four, all closed) and the single Reflexion turn, not the problem count. Posting now mostly to broaden coverage and pressure-test the framing before making stronger claims.
The setup
- 30 problems across 5 classes (SWE backend, operations, personal, civic, adversarial)
- Models output a
CanvasStateJSON. The simulator scores it on stability, delivery, cost, constraints, and design. - Open-source CLI:
pip install git+https://github.com/collapseindex/chini-bench-cliruns any model end-to-end with your own API key - Harness is hash-pinned (
chini-bench-cli:06d0ffb42f19) so leaderboard runs are reproducible
Single-shot results so far (4 frontier models, 30 problems each, 120 runs)
- Combined coverage: 10 of 30 problems passed by at least one model
- A handful of problems werenāt passed by anyone in this batch
- Roughly: best class is operations (PC2), weakest is adversarial (PC5)
Per-class slices are six problems each, so treat the class-level ordering as directional rather than definitive.
Reflexion track, early observations
I added a second turn: run v1, simulator emits a redacted FeedbackPacket (no scores, just which checks failed), model writes v2, submit v2.
| Model | Avg v1 | Avg v2 | Ī | Passes after revision |
|---|---|---|---|---|
| Gemini 3.1 Pro | ~73 | ~73 | 0 | 2 of 30 |
| Grok 4.20 | ~65 | ~68 | +3 | 1 of 30 |
| GPT-5.4 | ~64 | ~60 | -4 | 0 of 30 |
| Claude Sonnet 4.6 | ~62 | ~53 | -9 | 0 of 30 |
A tentative read:
- Possible overshoot pattern in Claude and GPT runs: feedback flags a failed check, the model restructures more than needed, often adds a component, and ends up tripping a count or constraint limit.
- Possible flat-revision pattern in Gemini runs: starts highest, patches the exact thing the feedback flagged, preserves what worked, but doesnāt actually move the average. v2 ā v1, the wins and losses cancel out. Lands at the top of the table by virtue of a strong v1, not by improving.
If that pattern holds across more models, it would suggest a search-strategy gap (when to patch vs. when to rewrite) more than a reasoning gap. With four models and a single seed per problem, Iām not ready to call that a finding. Itās a hypothesis Iād like to stress-test against open-weights models and longer Reflexion chains.
Net Reflexion v2 passes across the four models in this batch: 3 of 120.
Caveats I want to be upfront about
- Only four models, all closed-weights. The āfrontierā framing is incomplete until open-weights models (Llama, Qwen, DeepSeek, Mistral) are on the board.
- One Reflexion turn only. Multi-turn (2-3 rounds) might tell a different story.
- Single seed per problem. The simulator is deterministic, but model sampling isnāt, so seed-level variance isnāt characterized.
- The problem set reflects my judgement about what matters in distributed-systems design. Critique welcome on coverage and weighting.
Whatās open
- All 30 problems and canonical prompts: CHINI-bench - Chinilla
- Methodology and scoring math: Methodology - CHINI-bench
- CLI source (PolyForm Noncommercial): GitHub - collapseindex/chini-bench-cli: Standalone CLI for the CHINI-bench AI system-design benchmark Ā· GitHub
- Live leaderboard with the Reflexion track split out: Leaderboard - CHINI-bench
What Iād love help with
- Open-source model runs (Llama, Qwen, DeepSeek, Mistral, anything you have a key or local setup for). The CLI supports Ollama for local and OpenRouter for hosted.
- Pushback on the overshoot/undershoot framing. Is there a model youād expect to behave differently? A reading of the data Iām missing?
- Reflexion variants. Does 2-3 turns close the gap, or amplify whichever mode the model started in?
Come check it out! Happy to walk through any of the methodology, scoring weights, or harness details.
A quick note on submission integrity: scoring runs server-side against the canonical problem definitions, so submitters canāt ship their own scores or modified rules. Reflexion submissions include the v1 canvas and the server re-scores it; if the self-reported v1 number doesnāt match what the simulator actually produces, the row gets flagged for review.
Public CLI runs carry a harness hash (chini-bench-cli:06d0ffb42f19 for single-shot, chini-bench-reflex:42769353289d for Reflexion); anything else is tagged custom. Community submissions are auto-prefixed community: so no one can impersonate official model rows.
- alex