Been building a stateful PHI de-identification system for streaming multimodal data. Here's what I learned

A name in a clinical note. The same name in an ASR transcript ten minutes later. A matching date in a waveform header. Three records, each one harmless on its own. Together they’re enough to re-identify a patient, and most masking systems never see it coming because they process each record in isolation.

That’s the problem I’ve been working on. The system tracks cumulative PHI exposure per patient across modalities and time, maintains a risk score, and escalates masking policy automatically as exposure accumulates. When a text record and an audio transcript cross a similarity threshold suggesting they describe the same patient, the risk score adjusts before the next record is processed.

Getting the adaptive policy to outperform static baselines was straightforward on bursty workloads. Privacy 0.991, utility 0.847, delta-AUROC -0.917. What surprised me was how much the gap narrows on monotonic risk streams, where simple pseudonymization does almost as well.

One thing that took longer than expected was getting the synthetic rewrite component to produce consistent placeholder formatting. Solved it, but it took two full training runs to figure out why.

One thing I’m still thinking about: the adaptive policy outperforms every static baseline on bursty workloads, but the gap narrows on monotonic risk streams where simple pseudonymization does almost as well. Is that a fundamental limitation of threshold-based escalation, or a training data coverage issue? Curious if anyone has hit the same tradeoff.

Full system is open-source.

Spaces: https://huggingface.co/spaces/vkatg/amphi-rl-dpgraph

https://huggingface.co/spaces/vkatg/dcpg-scorer-demo

Models: https://huggingface.co/vkatg/exposureguard-synthrewrite-t5

https://huggingface.co/vkatg/exposureguard-policynet

https://huggingface.co/vkatg/exposureguard-fedcrdt-distill

https://huggingface.co/vkatg/exposureguard-dcpg-encoder

https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer

https://huggingface.co/vkatg/exposureguard-dagplanner

Datasets: https://huggingface.co/datasets/vkatg/multimodal-phi-masking-benchmark

https://huggingface.co/datasets/vkatg/streaming-phi-deidentification-benchmark

https://huggingface.co/datasets/vkatg/dag_remediation_traces

Code: https://github.com/azithteja91/phi-exposure-guard

1 Like