GitHub
Preprint
Building on Anthropic’s Circuit Tracer, Neuronpedia, and Circuit Tracing (Lindsey et al., 2025), we extend the paradigm to enable recursive self-interpretation, where models continuously monitor, trace, and explain their own decision processes, presented as interactive artifacts hosted on each frontier AI’s system.
1. Core Recursive Attribution Architecture
The framework below establishes a systematic approach to making Claude and other frontier AI’s internal processes more transparent and analyzable for Anthropic’s circuit tracing research.
framework:
name: "recursive_attribution_framework"
version: "1.0.0"
alignment: "circuit_tracing_research"
core_principles:
- "Expose computational pathways through structured attribution"
- "Enable feature intervention for causal confirmation"
- "Provide multi-level analysis from tokens to concepts"
- "Support cross-model and cross-language comparison"
- "Make reasoning faithfulness empirically verifiable"