Hi,
I made some adaptations to the deduplication scripts in the Code Parrot research repository. Here is what a barebones deduplication script would look like with MinHash and LSH: GitHub - conceptofmind/Huggingface-deduplicate
I also opened up a new post regarding a general use case here: Minhash Deduplication
Best,
Enrico