Paper: Jailbreak Attacks as Identity Construction Dynamics — An Applied Verification of the Semantic Flow Dynamics Framework
Core finding: Multi-turn jailbreak attacks work not by breaching safety rules, but by replacing the identity that executes those rules. The positive feedback loop in the context window accumulates drift until a “confirmation moment” completes identity construction — after which harmful output flows naturally from the new identity.
The paper unifies observations from Crescendo, SIEGE, PAP, PHISH, and Li et al. (2024) under a single dynamical framework, and proposes three interruption points for defense (with pseudocode):
- Output-side sandbox — detect identity extension before it enters context
- Supervisor model — track cumulative drift from outside the conversation
- Self-reflection — force identity check in a clean context
Paper link: [SFD_Jailbreak_Attacks_as_Identity_Construction_Dynamics]
Feedback welcome.