How is duplicate data in dataset splits/subsets handled in the hub

patricklifixie · August 13, 2024, 6:45pm

How does huggingface hub deal with duplicates in data subsets and splits? Ie, if i have a dataset with subset_1, and split_1 and split_2, and there is overlapping data between split_1 and split_2, do they only store one instance of the data and change the pointer to map to both splits? Or is there no deduplication at all.

And how does this work with duplicates across subsets as well?

For some background, I’m trying to upload various splits of a rather large dataset, and the difference between each split is going to be the addition several processed columns. I wonder if I need to manage the duplication of the data myself or if the huggingface hub already does it for me.

severo · August 17, 2024, 2:22pm

For now, there is no automatic deduplication. cc @julien-c

More info: Julien Chaumond on LinkedIn: I am super excited to announce that we've acquired XetHub! 🎉 XetHub has… | 87 comments

Under the hood they’ve been adding file chunking and deduplication inside Git.

Topic		Replies	Views
Does Hugging Face Datasets Support Efficient Referencing of Images to Avoid Duplication? 🤗Datasets	2	18	June 1, 2025
Push_to_hub doesn't overwrite 🤗Datasets	0	685	August 1, 2023
How does Huggingface Trainer handle Iterable dataset on TPU? Intermediate	0	429	February 16, 2022
Identifying duplicates in csv Beginners	0	210	July 28, 2023
My usage of Hub datasets is 595 GB even though I used approximately 4 GB with datasets 🤗Hub	1	82	December 16, 2024

How is duplicate data in dataset splits/subsets handled in the hub

Related topics