It does create files because it writes the resulting dataset on your disk to reload it from there using memory mapping (and save some RAM). To keep your dataset in memory instead, you can pass keep_in_memory=True to map
When caching is disabled, dataset files are written to temporary directories.
As mentioned above, It creates lot of cache files at each step. Is there a way we can clean cache files after each map step?
Also, if we use keep_in_memory=True with num_proc>1, it slows down.
I am using v1.16.1 and I have certain constraints to upgrade.
Is there a better way to speed up these preprocessing steps without caching lot of files (constraint on disk space), or with keep_in_memory=True and num_proc > 1 ?
If the data is in memory, then it currently needs to be copied to the subprocesses (or to use the disk to do so). On the contrary, data loaded from your disk can be reloaded instantaneously thanks to memory mapping. That’s why starting a parallel map is usually faster with data from your disk.
It should be possible to use a Plasma store instead of copying all the data to each subprocesses but this isn’t implemented
However, I believe it is worth issuing a warning when we use cached files due to a scenario I encountered:
During my initial process function, there was a bug. After fixing the bug, I found that the program would skip the map process unless I changed the num_proc=4 parameter.
I find this behavior to be unexpected and undesirable. Thanks for considering my viewpoint.