Extracting Training Data from GPT-2 (+ Differential Privacy)

Carlini et al. (2020) (https://arxiv.org/pdf/2012.07805.pdf) show that it is possible to extract portions of training examples from language models. It would be cool to demo this with HuggingFace, then show that we can prevent this extraction by training these models in a differentially private manner. JAX is particularly well suited to running DPSGD efficiently, so this project is based on the Flax GPT-2 implementation.

So far, in this notebook, I fine-tuned GPT2 on wikitext, then tried to extract training examples from the model using the techniques proposed in Carlini et al. I have not been able to get any sections of wikitext, and no longer have the bandwidth to continue this project.

If anyone’s interested in continuing this project, I’d be happy to help you get started.

Roughly, here are some potential next steps:

  1. Successfully extract training samples some from the fine-tuned GPT-2.
  2. Use the filtering techniques described in the paper to extract training examples in a sample-efficient way (i.e. a large proportion of candidates are really from the training data).
  3. Fine-tune GPT-2 using DPSGD (example linked in notebook), ideally achieving a perplexity similar to the original.
  4. Demonstrate that no training samples can be extracted from the differentially private version.