How is duplicate data in dataset splits/subsets handled in the hub

How does huggingface hub deal with duplicates in data subsets and splits? Ie, if i have a dataset with subset_1, and split_1 and split_2, and there is overlapping data between split_1 and split_2, do they only store one instance of the data and change the pointer to map to both splits? Or is there no deduplication at all.

And how does this work with duplicates across subsets as well?

For some background, I’m trying to upload various splits of a rather large dataset, and the difference between each split is going to be the addition several processed columns. I wonder if I need to manage the duplication of the data myself or if the huggingface hub already does it for me.

For now, there is no automatic deduplication. cc @julien-c

More info: Julien Chaumond on LinkedIn: I am super excited to announce that we've acquired XetHub! 🎉 XetHub has… | 87 comments

Under the hood they’ve been adding file chunking and deduplication inside Git.