How can I drop duplicates on datasets module?

I have a very large dataset (about 200 GB) which a need to certify that are not duplicates, dropping them. How can I achieve that on datasets? Thanks!

One possible solution -which doesn’t exactly answer the question- is to:

  1. Convert the Dataset object to a Pandas DataFrame object (e.g. pandas.DataFrame(your_dataset)).
  2. Employ pandas.DataFrame.drop_duplicates function.
  3. Finally, return to a Dataset object using datasets.Dataset.from_pandas function.

I had the same question with you and I couldn’t find anything better.

Hi,

Papers such as OPT and PaLM detail dataset deduplication with MinHash and locality-sensitive hashing (LSH) with a Jaccard similarity of .9 or so. You would have to implement these algorithms or you can use Google’s tool for deduplicating text: GitHub - google-research/deduplicate-text-datasets.

HuggingFace also has an example of text deduplication in this repository here: transformers/preprocessing.py at main · huggingface/transformers · GitHub and transformers/minhash_deduplication.py at main · huggingface/transformers · GitHub.

Best,

Enrico

1 Like

Hi,

I made some adaptations to the deduplication scripts in the Code Parrot research repository. Here is what a barebones deduplication script would look like with MinHash and LSH: GitHub - conceptofmind/Huggingface-deduplicate

I also opened up a new post regarding a general use case here: Minhash Deduplication

Best,

Enrico

1 Like