Use existing Dataset with a generator

I forgot to mention that I was trying to avoid mapping the entire dataset at once because of its large size (+170GB). That’s what I initially did, but the map takes 2-3 hours at the start of EVERY training run (and inflates everything to +700GB), which is infeasible for me.

You can use multiprocessing (the num_proc parameter) and disable image decoding with ds= ds.cast_column("image", Image(decode=False) (turn it back on later with ds = ds.cast_column("image", Image()) to make the processing faster. Also, it’s best to run this transform on the “raw” dataset to allow reloading from the cache on later runs.

The only on-the-fly, one-to-many-examples option is the one mentioned above, right?

Correct!

1 Like