How can I drop duplicates on datasets module?

Hi,

I made some adaptations to the deduplication scripts in the Code Parrot research repository. Here is what a barebones deduplication script would look like with MinHash and LSH: GitHub - conceptofmind/Huggingface-deduplicate

I also opened up a new post regarding a general use case here: Minhash Deduplication

Best,

Enrico

1 Like