How does huggingface hub deal with duplicates in data subsets and splits? Ie, if i have a dataset with subset_1, and split_1 and split_2, and there is overlapping data between split_1 and split_2, do they only store one instance of the data and change the pointer to map to both splits? Or is there no deduplication at all.
And how does this work with duplicates across subsets as well?
For some background, I’m trying to upload various splits of a rather large dataset, and the difference between each split is going to be the addition several processed columns. I wonder if I need to manage the duplication of the data myself or if the huggingface hub already does it for me.