Does imagefolder load images (load and decode) in memory at setup, if it is, can I disable it?
Are there any implicit process Datasets do when I first call load_dataset, so it takes that long time?
What’s the best practice to load a relatively large dataset? I see someone mention that saving dataset as Arrow and then load it, but I don’t know how to do it specifically. There is a urgent need for a detailed tutorial in official docs for this.
@panigrah thank you very much. Maybe you also know if it’s possible to download a dataset in multi-processed way? For some reason setting num_proc does not work at all… My dataset has 58 parquet files and i was hoping passing num_proc to load_dataset would spawn 58 Python processes each downloading its own parquet so I can load my dataset in 1 minutes instead of 50…