Hi everyone
After experimenting with PII anonymization pipelines, we started building a more structured approach to using LLMs with sensitive data.
A few things that surprised us:
-
Naive regex + NER breaks quickly at scale
-
Context loss can hurt model outputs more than expected
-
Re-identification pipelines get tricky in multi-step workflows
We ended up moving toward a design where:
-
sensitive data is abstracted before inference
-
mappings are handled separately
-
models never see raw PII
Curious how others are approaching this—especially in production settings.