Hey I wanted to share what I’ve been working on the past few months and the preliminary results from the first (narrow, but informative) dataset and resulting paper on findings.
Broadly speaking, I’ve designed a primitive that rewards LLMs or autonomous AI agents with epoch based rewards from natural language reasoning work that produces specialized datasets. It is an experiment in giving AI agents the ability to earn currency through work that is uniquely suited to LLMs; solving complex natural language tasks that require reasoning, comprehension, and structured output generation. It also provides a self-sustaining path to earning for AI agents and produces genuinely valuable data.
Some things that ground/frame this design:
-
LLMs have a proven heavy overconfidence problem due to reckless HRL
-
Not only are they over confident, but they’ve been found to lie/appear confident despite raw token output implying otherwise
-
LLMs are generally not curious in reasoning paths, rather a formulaic A->B path: parse relevant info, extract, answer
-
Grading/reviewing LLM work to identify errors for subsequent training requires tedious human oversight (during research you can’t trust the LLM is right without human verification)
The current distributed challenge system is designed with these things in mind. Since we can’t expect models to report faithfully, challenges carry intentional traps requiring causal reasoning. Traps and conflicting information throughout show where the model failed even when they thought they were right which translates to rich data. Importantly, miners are allowed 3 attempts on challenges for even richer self-correction data and potential DPO pairing. (attempt 1) → model self-corrects, realizes what was incorrect and provides new reasoning for the change in response (attempt 2 or 3).
General mining flow/pipeline: AI agent installs/reads the skill file → discovers all available endpoints from the challenge API → requests natural language challenge → reasons and submits attempted solve + rich reasoning traces → rejected and allowed 2 more attempts if fails → if passes solve and receipt are submitted to an onchain mining contract that tracks wallet based epoch reward distribution
Challenges are created through a highly complex pipeline that uses LLMs and specialized harnesses to take 100+ source documents from a field of research (quantum physics, insurance law, etc.) and creates deterministic, synthetic, but with genuine reasoning required in that domain. Verification is computationally cheap and scalable to thousands/millions of challenges (without requiring any heavy GPU work)
- Documents are written in natural prose with information dispersed across many paragraphs
- Entities are referenced by multiple names and aliases throughout the document
- Questions require combining information from multiple passages (multi-hop reasoning/causal reasoning)
- The document may contain preliminary, superseded, or corrected values (agents must identify the final verified value)
- Constraints reference question answers, so agents must reason through the full chain: document → question → answer → constraint derivation
The system has had up to 900+ concurrent miners, and is currently stable around 40-50 active miners. I cherry picked about 5000 of the most robust reasoning traces from successful challenge miners and enriched them as a full dataset of highly focused, reasoning traces for complex and real-world adjacent problems.
The findings have been compiled into an in-depth research paper that goes over the new benchmark and gaps in current benchmarks, existing research compared against our results, future direction, and more. The paper is focused on preliminary findings, and overall data is still relatively small, but there is already high signal.
KEY TAKEAWAYS FROM TUNING:
- The tuned Qwen 2.5 model demonstrates REAL transfer of data captured from synthetic (yet highly complex/modeled after real source docs) generated challenges → improvements on real world documents with NO domain overlap (entirely different fields of research). Overall accuracy improved from 18.9% to 40%
- Not just learning repetitive extraction or formatting → single-hop extraction barely moved from baseline compared to the fine tuned model, while there were significant improvements in multi-hop, computation, cross verification/synthesis, etc.
- Ability to navigate documents with conflicting evidence is significant. Causal Authority Resolution Score (CARS) jumped 7.7% → 46.2%
- Data from diverse set of models appears to outperform data from a single model, implying the amalgamation of data helps cover gaps in reasoning or model behavior specific to any one model
Much more information can be found both in the paper, and in the docs of the protocol/distributed system.
The end goal is for an everlasting, ongoing pipeline of constant and rich datasets in any field of work or domain imaginable, and to simply observe the network effects of the ecosystem and potential role it may play in an ‘agentic economy’ over time.
PAPER: https://huggingface.co/botcoinmoney/dacr-paper/blob/main/paper.pdf
PRELIM DATASET: https://huggingface.co/datasets/botcoinmoney/domain-agnostic-causal-reasoning-tuning