I forgot to mention that I was trying to avoid mapping the entire dataset at once because of its large size (+170GB). That’s what I initially did, but the map takes 2-3 hours at the start of EVERY training run (and inflates everything to +700GB), which is infeasible for me.
You can use multiprocessing (the num_proc
parameter) and disable image decoding with ds= ds.cast_column("image", Image(decode=False)
(turn it back on later with ds = ds.cast_column("image", Image()
) to make the processing faster. Also, it’s best to run this transform on the “raw” dataset to allow reloading from the cache on later runs.
The only on-the-fly, one-to-many-examples option is the one mentioned above, right?
Correct!