I have a very large dataset (about 200 GB) which a need to certify that are not duplicates, dropping them. How can I achieve that on datasets? Thanks!
One possible solution -which doesn’t exactly answer the question- is to:
- Convert the Dataset object to a Pandas DataFrame object (e.g.
- Employ pandas.DataFrame.drop_duplicates function.
- Finally, return to a Dataset object using datasets.Dataset.from_pandas function.
I had the same question with you and I couldn’t find anything better.
Papers such as OPT and PaLM detail dataset deduplication with MinHash and locality-sensitive hashing (LSH) with a Jaccard similarity of .9 or so. You would have to implement these algorithms or you can use Google’s tool for deduplicating text: GitHub - google-research/deduplicate-text-datasets.
HuggingFace also has an example of text deduplication in this repository here: transformers/preprocessing.py at main · huggingface/transformers · GitHub and transformers/minhash_deduplication.py at main · huggingface/transformers · GitHub.
I made some adaptations to the deduplication scripts in the Code Parrot research repository. Here is what a barebones deduplication script would look like with MinHash and LSH: GitHub - conceptofmind/Huggingface-deduplicate
I also opened up a new post regarding a general use case here: Minhash Deduplication