I have a image dataset with multiple splits. Unlike normal train/val splits, they are supersets of each other. E.g. split A contains all images from split B and some additional ones, then split C contains all images from split A etc.
The dataset is loaded as multiple JSON files, in each file there are references to the image (as a feature column). Acorss all splits, reference to the same image will always use the same path.
My question is, when uploading to the hub, will this upload the same images multiple times? What I am observing now is that it is creating parquets per split, and then uploading them, which means all images would be duplicated because they are in the parquets and no longer isolated.
Is there a way to disable this behaviour, or perhaps prevent it from duplicating the images in the parquet?
1 Like