Oracle-Verified Reasoning Supervision via Deterministic Generation (Verify-or-Fix + Witnesses + Traces)

Abstract

We release Diamond Logic Miner, a deterministic data generator that produces oracle-verified reasoning supervision for training and evaluation of LLM reasoning. Unlike LLM-synthetic datasets, labels are computed by a Python oracle (no model-generated answers), enabling auditable ground truth with explicit counterexamples (“witnesses”) and bounded step traces.

Method (Deterministic Oracle Generation)

Each record is produced by:

Procedural instance generation (graphs, DP instances, number theory instances, debugging variants).

Oracle solution computation (mathematically exact).

Optional verify-or-fix packaging: provide a candidate solution and require the model to either confirm correctness (hard positive) or correct it and provide a witness (hard negative).

Generation-time verification (“–verify”): the record is re-verified by the oracle at generation time, preventing silent label corruption.

Deduplication + stable schema + integrity hashes (manifest + SHA256).

Supervision Format

Records provide:

Ground truth answer (oracle-computed)

Witness for incorrect candidates (counterexample path / better solution / constraint violation)

Reasoning traces with capped verbosity (e.g., Dijkstra pop/relax steps; DP transition updates), enabling “how-to-think” supervision rather than answer-only training.

Tasks Covered

Graph algorithms: Dijkstra shortest path, SCC, Longest path on DAG

Dynamic Programming: 0/1 Knapsack

Number Theory: Chinese Remainder Theorem

Debugging-style tasks: hard negatives/positives

Release Artifacts

Preview (public):

Pilot Pack (gated):

Pilot pack properties (generation run):

Size: 1,000,000 unique records

Format: JSONL, gzip-compressed shards

Shards: 40 (shard-00000 … shard-00039)

Compressed size: ~0.92 GiB

Enterprise ingestion: includes datasheet.md, manifest.json, sha256sums.txt

Intended Use-Cases

Post-training / SFT for verification behavior

Verify-or-fix supervision trains discrimination: “confirm if correct” vs “correct with proof if wrong”, reducing reliance on heuristics.

Evaluation of reasoning reliability

Because each instance has oracle ground truth and witnesses, the dataset supports reproducible evaluation of “verification accuracy” and failure-mode analysis.

Preference / reward-style supervision without human labeling

Witnesses naturally define preference signals (correct vs plausible-wrong candidate), enabling preference data construction or reward modeling for verification.

Curriculum & controlled experiments

Difficulty scaling + trace caps support controlled studies of multi-step reasoning and generalization.

Limitations / Notes

This dataset targets logical/mathematical reasoning and verification, not world knowledge pretraining.

Viewer availability on the Hub may depend on post-processing; the dataset remains fully usable via standard file download / HF datasets tooling.

Feedback and evaluation results are welcome (e.g., zero-shot vs LoRA on verify-or-fix tasks, witness utilization ablations, trace-budget sensitivity).

1 Like