Minhash Deduplication

Hi @liyongsea ,

Additionally to the warning and error above, I have also been experiencing a second error which sometimes occurs on datasets around 300 GBs of size:

Time to filter dataset: 866.99| 5244089/5244135 [1:22:04<00:00, 1083.08ex/s]
Size of filtered dataset: 52435971| 5237337/5244135 [1:21:50<00:04, 1364.48ex/s]
2889863it [33:35, 1635.90it/s]
Process ForkPoolWorker-1:
Killed
Traceback (most recent call last):
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

I believe this error may be in relation to this function:

def minhash_iter(dataset_iterator: Type[Dataset]):
    with mp.Pool() as pool:
        for data in pool.imap_unordered(
            _compute_min_hash,
            ThreadedIterator(dataset_iterator, max_queue_size=10000),
            chunksize=100,
        ):
            if data is not None:
                yield data

Thank you,

Enrico