Hi everyone ![]()
I’ve been working on a few ML pipelines recently (datasets + inference APIs), and I ran into something I don’t see discussed much:
Sensitive data leaking into datasets and logs
In a couple of cases, I noticed things like:
-
Emails and phone numbers inside training data
-
User inputs getting stored in logs during inference
-
Internal IDs or tokens accidentally exposed
Curious how others are handling this:
-
Do you clean/anonymize datasets before training?
-
Are you masking data in logs and API responses?
-
Or is this something you only worry about for compliance later on?
What surprised me:
Even small datasets can contain a lot of hidden sensitive info, and once it’s used in:
-
training
-
embeddings
-
logging systems
…it becomes much harder to clean up afterward.
Would love to hear:
-
Your current approach
-
Any tools or workflows you use
-
Best practices you follow
