streaming PHI de-identification benchmark with cross-modal causality annotations

Most clinical de-identification benchmarks evaluate per-record masking accuracy. This dataset does something different, it annotates why a re-identification risk threshold was crossed, not just that it was.

The key field is trigger_reason, which distinguishes between two causes:

  • cross_modal_linkage โ€” risk crossed the threshold because text + ASR (or text + imaging metadata) appeared together for the same patient. Neither modality alone would have triggered it.
  • exposure_accumulation โ€” risk crossed through baseline event accumulation, no cross-modal amplification involved.

That counterfactual ground truth does not exist in i2b2, n2c2, or MIDI-B. Those benchmarks tell you whether PHI was masked correctly. This one tells you what caused the risk crossing in the first place.

The dataset also includes a live scorer you can try before downloading anything:

Dataset: vkatg/streaming-phi-deidentification-benchmark ยท Datasets at Hugging Face

Two models trained on it:

I am looking for feedback on cross-modal linkage detection approach or the counterfactual annotation methodology, and happy to discuss/answer questions about the approach.

1 Like