If you are primarily concerned with preventing duplication, it may be better to save files by URL or file name, but this may not be very convenient for large datasets. @lhoestq
I have a image dataset with multiple splits. Unlike normal train/val splits, they are supersets of each other. E.g. split A contains all images from split B and some additional ones, then split C contains all images from split A etc.
The dataset is loaded as multiple JSON files, in each file there are references to the image (as a feature column). Acorss all splits, reference to the same image will always use the same path.
My question is, when uploading to the hub, will this upload the same i…
Hi Bert, thanks for reaching out, and good job with segments.ai !
You mentioned three different ways of hosting an image dataset, and they’re all sensible ideas. Here are a few aspects that can help deciding which one is best depending on your case:
Storing the URLs. It has several disadvantages: less convenient, less reproducibility, and probably doesn’t work in the long run. This should be avoided as much as possible IMO. However for certain datasets with copyright/licensing issues this ca…