Most clinical de-identification benchmarks evaluate per-record masking accuracy. This dataset does something different, it annotates why a re-identification risk threshold was crossed, not just that it was.
The key field is trigger_reason, which distinguishes between two causes:
cross_modal_linkageโ risk crossed the threshold because text + ASR (or text + imaging metadata) appeared together for the same patient. Neither modality alone would have triggered it.exposure_accumulationโ risk crossed through baseline event accumulation, no cross-modal amplification involved.
That counterfactual ground truth does not exist in i2b2, n2c2, or MIDI-B. Those benchmarks tell you whether PHI was masked correctly. This one tells you what caused the risk crossing in the first place.
The dataset also includes a live scorer you can try before downloading anything:
Dataset: vkatg/streaming-phi-deidentification-benchmark ยท Datasets at Hugging Face
Two models trained on it:
- PolicyNet (decides masking action from exposure state): vkatg/exposureguard-policynet ยท Hugging Face
- SynthRewrite-T5 (rewrites notes with versioned pseudonyms): vkatg/exposureguard-synthrewrite-t5 ยท Hugging Face
I am looking for feedback on cross-modal linkage detection approach or the counterfactual annotation methodology, and happy to discuss/answer questions about the approach.