NeuroTrace – GPT-2 Small Residual Attack & Defence Framework (IOI Task)


NeuroTrace – GPT-2 Small Residual Attack & Defence Framework (IOI Task)

Status: Research demo – Code-first, no model weights uploaded here.
Code & checkpoints: GitHub - Daniele-Cangi/Neuro-Trace: Neural Network Interpretability Framework
Base LM weights: gpt2 from Hugging Face (openai-community/gpt2 or equivalent)


1. What is NeuroTrace?

NeuroTrace is an end-to-end research framework for:

  • Attacking GPT-2 Small on the Indirect Object Identification (IOI) task
  • Defending it using learned steering vectors in the residual stream
  • Wrapping everything in an integrated “immune system” that decides when and where to intervene

The project is built around GPT-2 Small (124M) and uses:

  • 12 Sparse Autoencoders (SAE) (one per layer) to build an “atlas” of 73,728 interpretable features
  • Gradient-trained adversarial vectors (“virus”) that collapse IOI accuracy
  • Gradient-trained task boost vectors (“boost”) that restore (and sometimes enhance) IOI performance
  • A multi-stage defence architecture:
    • domain classifier (IOI vs general text)
    • damage / vulnerability detectors
    • gated injection of steering vectors
    • domain-guard to avoid collateral damage on non-IOI text

This Hugging Face entry is mainly a technical companion to the GitHub repo, not a model weights host.
All code, phase scripts and checkpoints live here:

:backhand_index_pointing_right: GitHub - Daniele-Cangi/Neuro-Trace: Neural Network Interpretability Framework


2. High-Level Story

The project evolved in four conceptual acts:

  1. Atlas Construction (SAE) – Train 12 enhanced sparse autoencoders over all GPT-2 layers to map the residual stream into 73,728 features. Use them to hunt for IOI-related circuits and “killer features”.

  2. Dense Steering (Virus & Boost) – Abandon sparse feature control and directly optimize residual vectors that:

    • destroy IOI behaviour (virus),
    • repair or enhance IOI behaviour (boost).
  3. Damage Control & Defence – Show that naive, static use of these vectors is catastrophically toxic on general text (WikiText), then build a layered defence that:

    • detects when the model is under attack or vulnerable,
    • activates the boost only when needed,
    • completely disables the system outside the IOI domain.
  4. Generalization & Off-Manifold Effects – Show that:

    • the boost does improve some out-of-distribution IOI-like prompts,
    • but its decomposition in SAE space reveals extreme off-manifold interference,
    • so robust safety demands architecture around the model, not just a “magic vector”.

The whole journey is broken into Phase 0–14 in the codebase.


3. Components & Architecture

3.1 Base Model

  • Language model: GPT-2 Small (124M), loaded via:
    • either hooked-transformer (TransformerLens style),
    • or transformers (for perplexity / WikiText evaluation).
  • Task: Indirect Object Identification (IOI)
    Synthetic prompts where the model must choose the correct indirect object name, e.g.:

“When Mary and John went to the market, Mary gave a key to …”

The IOI setup follows the “who gave what to whom” pattern where the model must distinguish subject vs object names.


3.2 SAE Atlas (Phase 0–3)

Goal: build an interpretable atlas of the residual stream.

  • For each layer ℓ ∈ {0,…,11}:
    • collect residual activations on IOI data,
    • train an enhanced SAE with:
      • input dim = 768 (GPT-2 hidden size)
      • feature dim = 6144 (expand to sparse code)
      • loss: MSE + L1 sparsity
  • Total: 12 SAE × 6144 features = 73,728 features.

What this gives you:

  • A feature space where some features correlate strongly with IOI failure or success.
  • In particular, some “IOI killer” features (e.g. layer 9, feature 3428) are highly correlated with mis-identification of the indirect object.

Key finding (early):
SAE features are extremely useful as diagnostic signals (they tell you where something is happening), but turning them into control levers (steering behavior by clamping/ablating a few features) fails, even at scale.


3.3 Dense Adversarial Steering (Phase 4–6)

Here the project pivots from “sparse control” to direct dense residual steering.

3.3.1 Virus (Adversarial Residual Delta)

  • Learn a vector δ ∈ ℝ⁷⁶⁸ at a given layer (initially layer 10, then sweep over all layers) such that:

[
\text{logit_diff(correct vs wrong IOI name)} \rightarrow \text{strongly negative}
]

  • Phase 4B / 5A:
    • On IOI:
      • accuracy drops from ~97% → down to ~36% (or near 0% for borderline subsets)
    • On a full layer sweep (0–11):
      • layers 0–7: ≈ −97% accuracy (catastrophically vulnerable)
      • layers 8–10: still highly vulnerable, but gradually less
      • layer 11: almost unaffected → behaves like a robustness buffer

Result: small, dense residual deltas can completely flip IOI decisions.

3.3.2 Virus vs SAE (Phase 4B-B / 5B / 6)

We then project the adversarial δ into SAE space:

  • High projection (~88–90% of the norm) into the SAE feature space → the virus does live inside the SAE subspace.
  • But:
    • Top-10 / Top-50 / Top-100 SAE features carry only a small fraction of the effect.
    • To reproduce the full adversarial effect, you need thousands of features.

Conclusion: The virus is dense even in SAE space.
The “sparse + monosemantic” intuition breaks: the control direction lives in a high-dimensional, distributed subspace.


3.4 Boost Vector & Task Steering (Phase 7)

Having a reliable attack, we look for a repair / enhancement direction.

3.4.1 Task Boost (Unconstrained)

  • Learn a vector ( v_{\text{boost}} \in \mathbb{R}^{768} ) such that:

    • it improves IOI performance, especially on “hard” examples (low logit diff),
    • and compensates the virus when both are applied.
  • Phase 7B:

    • Hard accuracy: 0.70 → 1.00 (100%) with boost.
    • Under attack:
      • accuracy goes from ~0.43 (virus only) to ~0.98 (attack + boost).

But:

  • Norm of v_boost ≈ 112 – far above typical residual norm (~25–30).
  • Phase 7C scaling experiments show a smooth trade-off:
    • small scaling → mild improvement,
    • large scaling (≈112) → full recovery under attack.

Interpretation: the learned vector behaves like a strong, gradient-optimized intervention, pushing activations far away from the natural manifold.

3.4.2 Constrained Boost (R = 25)

To make it more realistic, Phase 7D constrains ||v_boost|| to a plausible norm (~25):

  • Hard accuracy still improves significantly:
    • baseline → +22.5% on hard cases
  • Under attack:
    • partial recovery; not as strong as unconstrained, but still robust.

This sets up the core tension:

“We can learn a vector that repairs the task – but at what cost elsewhere?”


3.5 Defence Architecture (Phase 8–9+14)

This is the core of the NeuroTrace Defence System.

3.5.1 War Surface (Phase 8B)

  • Construct a 2D grid over scaling coefficients (α for virus, β for boost):

[
h’ = h + \alpha \cdot \delta_{\text{virus}} + \beta \cdot v_{\text{boost}}
]

  • For each (α, β), evaluate IOI performance (especially on hard examples).

The result is a “war surface”:

  • For α = 0 (no attack): increasing β improves IOI up to near-perfect performance.
  • For α = 1 (standard attack): around β ≈ 3 is enough to bring hard accuracy back ≥ 90%.
  • For larger α, you need larger β.

This map defines a phase boundary: a curve β*(α) above which the model returns to “healthy” IOI behaviour.

3.5.2 Immune Gating v1 (Phase 8A)

First attempt at a conditional defence:

  • Baseline forward pass → compute logit diff.
  • If logit_diff < threshold, apply the boost; otherwise, leave activations untouched.

Findings:

  • Hard accuracy with gating ≈ static defence.
  • Global accuracy drops (false positives on “difficult but correct” examples).
  • The gating based on raw logit_diff is too coarse: it cannot distinguish “uncertain but clean” from “truly damaged”.

3.5.3 Virus & Needs-Boost Detectors (Phase 9)

We then train lightweight probes to detect:

  1. Presence of the attack / virus-like pattern (Phase 9A, 9B)
  2. Whether the sample actually needs a boost to stay correct (Phase 9C, 9D)

Detectors are trained on small feature vectors:

  • logit differences,
  • projections onto virus subspace,
  • norm of virus component vs orthogonal component, etc.

Key result (Phase 9D – Gated Defence v3):

  • Under attack:
    • static defence: ~98.8% accuracy, but boost always on (100% interventions).
    • gated_defence_v3:
      • IOI test accuracy: ≈ 97.4% (≈ baseline),
      • hard accuracy: ≈ 92.3% (same as static defence),
      • GateRate ≈ 51% – only half of the samples get boosted,
      • False Positive Rate ≈ 0% – no unnecessary boosts on healthy cases.

So we obtain a surgical defence:

  • Only intervene where needed,
  • No degradation when the model would already be correct under attack.

3.5.4 Domain Guard & Integrated Defence (Phase 13–14)

Next problem: collateral damage on general text.

Phase 11 shows that applying static boost on WikiText-2 roughly doubles perplexity:

  • Baseline PPL ≈ 72
  • Static defence PPL ≈ 145 (:warning: severe damage)

To fix this, we train a context classifier (Phase 13):

  • Linear+MLP probe on layer-10 activations
  • Distinguish IOI domain prompts vs generic WikiText
  • Test accuracy ≈ 100%, WikiText false positives ≈ 0%

Finally, Phase 14 puts everything together:

  • On IOI under attack:
    • baseline: ~97.4% (clean), ~47.6% (under virus)
    • static defence: ~98.8%
    • gated_defence_v3: ~94.8% test acc, strong hard accuracy
  • On WikiText:
    • baseline NLL/PPL unchanged with domain_guarded defence
    • FP rate ≈ 0% → defence system never activates on general text

Bottom line:
An integrated, context-aware immune system can:

  • Restore IOI robustness under a strong, known attack
  • Limit interventions to vulnerable cases
  • Avoid collateral damage on non-IOI domains

All this without changing GPT-2’s weights: everything is done via hooks & external modules.


3.6 Off-Manifold & SAE Decomposition (Phase 10B, 11, 12+)

We also analyze what the boost vector actually is:

  • Project v_boost into SAE feature space at layer 10.
  • Observe that:
    • The sum of feature contributions has norm > 2.7× the norm of v_boost.
    • Many large positive and negative feature activations cancel out.
    • Reconstruction error from SAE increases with more features → v_boost is partially off-manifold.

Combined with the WikiText PPL explosion, this suggests:

  • v_boost is not a “clean semantic direction”,
  • but rather a gradient-optimized artifact that exploits delicate interference among many features.

This reinforces the architectural decision: do not rely on raw v_boost as a universal fix; wrap it with detectors, gates, and domain guards.


4. What’s Included (vs What’s Elsewhere)

On Hugging Face, this card describes the project.

The actual experimental assets are in the GitHub repo:

:link: GitHub - Daniele-Cangi/Neuro-Trace: Neural Network Interpretability Framework

There you will find:

  • phase_utils.py – shared utilities (dataset generation, model setup, IOI evaluation, hooks).
  • phase*_*.py – scripts for each phase:
    • SAE training & analysis
    • adversarial delta learning
    • sparse virus experiments
    • task boost training and scaling
    • war surface computation
    • defence & gating variants
    • detectors, domain guard, integrated defence
    • collateral damage / WikiText evaluation
    • alien generalization tests
  • checkpoints/ – locally saved:
    • SAE weights for each layer
    • adversarial deltas per layer
    • learned boost vectors (unconstrained, R=25)
    • virus / needs-boost / context detectors

Note: GPT-2 weights themselves are not in the repo; they are loaded from Hugging Face at runtime.


5. How to Reproduce (Conceptual)

Rough pipeline to follow in the repo:

  1. Setup

    • Install dependencies (PyTorch, hook-transformer / TransformerLens, datasets, etc.).
    • Verify you can load GPT-2 Small and run the IOI baseline.
  2. Train SAE Atlas (optional but recommended)

    • Run the Phase 1–3 scripts to train SAEs on each layer.
    • Optionally, run circuit discovery scripts to inspect IOI-related features.
  3. Train Virus (Adversarial Delta)

    • Run Phase 4B / 5A scripts to:
      • find borderline IOI examples,
      • optimize δ to invert logit diffs at a target layer,
      • sweep across layers to get a vulnerability profile.
  4. Train Boost

    • Run Phase 7 scripts to:
      • learn unconstrained v_boost,
      • then constrained v_boost (R=25),
      • evaluate baseline vs boost vs attack vs attack+boost.
  5. Defence & Detectors

    • Phase 8B: compute war surface (α, β grid).
    • Phase 8A: initial gating based on logit_diff.
    • Phase 9A/B/C/D: train virus detector and needs-boost detector, evaluate gated defence v3.
    • Phase 13: train context classifier (IOI vs WikiText).
    • Phase 14: integrated defence evaluation (IOI + WikiText).
  6. Collateral & Generalization

    • Phase 11: measure WikiText PPL with/without defence.
    • Phase 10: alien IOI prompts (clause breaks, passive voice, counterfactual instructions) to test semantic vs template-specific behaviour.

6. Key Takeaways (Research Level)

NeuroTrace supports several higher-level hypotheses:

  1. SAE != Steering
    Sparse, monosemantic features are ideal for measurement, not necessarily for control. IOI behaviour is governed by dense residual directions that do not reduce to a small set of SAE features.

  2. Residual Controls are Powerful but Dangerous
    Learned steering vectors can:

    • completely disable IOI reasoning,
    • completely restore it even under attack,
      but pushing the model into off-manifold regions creates heavy collateral damage on unrelated text.
  3. Safety is an Architectural Problem
    A robust defence is not a single vector, but a system:

    • detectors to recognize failure modes,
    • domain guards to avoid touching out-of-distribution content,
    • gated application that targets only those prompts that actually need help.
  4. Towards “Neural Immune Systems”
    Wrapping a frozen LM with:

    • context classification,
    • failure detectors,
    • steering modules,
      points toward a family of neural immune systems that can be layered on top of existing models without re-training the base weights.

7. Intended Audience

This project is most relevant for:

  • Mechanistic interpretability researchers exploring residual streams, SAE, circuit-level analysis.
  • Adversarial robustness / safety researchers interested in:
    • fine-grained steering of LMs,
    • defences that operate at activation level,
    • trade-offs between robustness and collateral damage.
  • Research engineers who want a concrete, reproducible framework that goes from:
    • feature discovery → attack → defence → integrated system.

8. Citation / Reference

If you use ideas or code from NeuroTrace in your work, please reference the GitHub repo:

Daniele Cangi, NeuroTrace: Residual Attack & Defence on GPT-2 Small (IOI Task), 2025.
GitHub: GitHub - Daniele-Cangi/Neuro-Trace: Neural Network Interpretability Framework


9. Limitations & Warnings

  • All experiments are on GPT-2 Small and a single synthetic task (IOI).
    Behaviour might differ on modern, larger LMs and real-world data.
  • Steering vectors and detectors are trained on specific distributions; their generalization to unseen attacks or domains is not guaranteed.
  • This is research code, not a production security system.

Use it as a lab, not as a safety guarantee.
here


> Blockquote
1 Like