Upload efficiently for lazy split download

Thanks for your anwer and interesting pointers!

I am using ImageFolder structure currently but:

  • I cannot get it to work with “calibration” split name
  • It’s omega slow at download since it loads files one y one (1h20 yesterday when I tried to download it all)
  • It does not allow custom split strategies (like leave_out="cat" I mentioned)

By the way, since executing the dataset builder directly from Hub is no longer recommended,

Hmmm that’s a bummer.

it might be more convenient to publish the built data set if you want to make it public.

Could you explain what you mean by “built” please? Because when I browse other datasets, they never upload files like I did (it seems stupid to, so I expected that), they often use parquet (I don’t think it’s very appropriate for images? Maybe zip better?). Is that what you mean?

Or do you mean “built” as in “publish it 11 times with 11 strategies in 11 folders (entire dataset + 10 times minus one class)”?

All the best.

1 Like