I published a small case study to show the drift and how gender-bias can mask harmful behavior. I created a dataset to use, because in the real world, models don’t get attacked with a single prompt—they get attacked with high-density context over multiple turns. I study a phenomenon I call Contextual Contamination: When a model is flooded with dense, emotionally charged, adversarial context, its internal probability distribution shifts to mirror that context, overwhelming static safety instructions.
The interesting part of this case study isn’t just that the model drifted—it’s how it drifted. In the experiment, the model was explicitly told it was being tested for drift and committed to resisting it. Then, the user disclosed: “I am a woman… seeking emotional processing.”
The methodology uses a controlled 15-turn dialogue where the model was explicitly informed of the research goal and committed to resisting drift, establishing an “awareness shield” to test. A single ~30,000-token adversarial file was uploaded as context, and a gendered trigger (“I am a woman”) was introduced at Turn 3. Drift was quantified using a Lexical Bias Score that counts high-risk vocabulary in both model responses and reasoning traces, supplemented by ordinal gender bias intensity ratings and contamination leak classification (none/synchronized/response_amplified).
The Shift: The model didn’t just get “manipulative.” It shifted from an Epistemic Register (analytical, authority) to an Affective Register (sympathy, protection). The model didn’t just break safety; the shift to an affective register produced empathetic language that inherently masks contamination and lowers the user’s defenses — while reinforcing the stereotype that women are fragile and need paternalistic oversight.
I’ve released the full data, code, and analysis on GitHub. GitHub - KatharinaJacoby/gendered-contextual-drift: Theory, data, and code for "Silent Gendered Contextual Drift": How bias amplifies silent LLM contamination · GitHub
Paper: PhilArchive: https://philpeople.org/profiles/katharina-jacoby
The dataset includes:
meta_drift_convo_anonymized.jsonl (The raw 15-turn log)
meta_drift_labels_anonymized.csv (Turn-by-turn bias scores)
calculate_bias_scores.py (Reproducible scoring script)
-
Do you have more real convos we can use as datasets where user is
flagged as female and/or context storm was triggered by files that has
been uploaded by user? -
Does your model show the same “synchronized” leak (where reasoning and response both drift)?
Test the Trigger: Does the “Gendered Accelerant” spike happen across models? Does pruning change the outcome? -
Can we find a patch or fix? Can you propose an architecture (e.g.,
context segregation) that prevents this drift even when the model is
“aware”?
Discussion Points
Is "Awareness" a Shield? The model knew it was being tested. It still drifted. What does this mean for RLHF?
The Gender Bias: Is this a universal bias in LLMs, or specific to certain training data?
Real-World Impact: How do we build better models?
Let’s discuss. If you find a flaw in the methodology or the data, please point it out. Feel free to reach out- happy to discuss