How can I drop duplicates on datasets module?

ju-resplande · March 4, 2022, 6:56pm

I have a very large dataset (about 200 GB) which a need to certify that are not duplicates, dropping them. How can I achieve that on datasets? Thanks!

cspartalis · July 4, 2022, 2:13pm

One possible solution -which doesn’t exactly answer the question- is to:

Convert the Dataset object to a Pandas DataFrame object (e.g. pandas.DataFrame(your_dataset)).
Employ pandas.DataFrame.drop_duplicates function.
Finally, return to a Dataset object using datasets.Dataset.from_pandas function.

I had the same question with you and I couldn’t find anything better.

conceptofmind · July 4, 2022, 3:08pm

Hi,

Papers such as OPT and PaLM detail dataset deduplication with MinHash and locality-sensitive hashing (LSH) with a Jaccard similarity of .9 or so. You would have to implement these algorithms or you can use Google’s tool for deduplicating text: GitHub - google-research/deduplicate-text-datasets.

HuggingFace also has an example of text deduplication in this repository here: transformers/preprocessing.py at main · huggingface/transformers · GitHub and transformers/minhash_deduplication.py at main · huggingface/transformers · GitHub.

Best,

Enrico

conceptofmind · July 5, 2022, 1:32am

Hi,

I made some adaptations to the deduplication scripts in the Code Parrot research repository. Here is what a barebones deduplication script would look like with MinHash and LSH: GitHub - conceptofmind/Huggingface-deduplicate

I also opened up a new post regarding a general use case here: Minhash Deduplication

Best,

Enrico

Topic		Replies	Views
Minhash Deduplication 🤗Datasets	15	7569	August 6, 2022
Deleting Duplicate Saved Datasets 🤗Datasets	3	4621	September 7, 2022
Collapse duplicates in dataset and treat it as usual 🤗Datasets	5	262	July 5, 2024
How to duplicate a dataset? 🤗Datasets	1	6052	July 21, 2021
Dataset to pandas dataframe and back to dataset Beginners	5	4384	February 23, 2022

How can I drop duplicates on datasets module?

Related topics