How do you handle sensitive data (PII) in datasets before training models?

Hi everyone :waving_hand:

I’ve been working on a few ML pipelines recently (datasets + inference APIs), and I ran into something I don’t see discussed much:

Sensitive data leaking into datasets and logs

In a couple of cases, I noticed things like:

  • Emails and phone numbers inside training data

  • User inputs getting stored in logs during inference

  • Internal IDs or tokens accidentally exposed

Curious how others are handling this:

  • Do you clean/anonymize datasets before training?

  • Are you masking data in logs and API responses?

  • Or is this something you only worry about for compliance later on?

What surprised me:

Even small datasets can contain a lot of hidden sensitive info, and once it’s used in:

  • training

  • embeddings

  • logging systems

…it becomes much harder to clean up afterward.

Would love to hear:

  • Your current approach

  • Any tools or workflows you use

  • Best practices you follow

Just feed it to the model, then do some RL to “mask” the data.
And huge company’s like Anthropic don’t care about this at all.

I think it’s a waste of time. I developed a new architecture that fully externalizes all AI memory function. Models are nothing but interchangeable compute power too me. I simply drag and drop what ever I want into it. Hit Digest to memory and then it’s done. The model can pull up any part of it anytime you want. Retraining models for anything = time and money both are wasted on an end result that isn’t perfect and or deterministic.