Dataset map() creates lot of cache files

lhoestq · September 15, 2022, 12:41pm

If the data is in memory, then it currently needs to be copied to the subprocesses (or to use the disk to do so). On the contrary, data loaded from your disk can be reloaded instantaneously thanks to memory mapping. That’s why starting a parallel map is usually faster with data from your disk.

It should be possible to use a Plasma store instead of copying all the data to each subprocesses but this isn’t implemented

Topic		Replies	Views
Working with large datasets - cache issues 🤗Datasets	1	1019	June 1, 2022
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2717	March 22, 2023
In-memory dataset to disk for caching operations 🤗Datasets	1	918	May 2, 2022
Keeping only current dataset state in cache 🤗Datasets	3	1298	August 30, 2022
Load_dataset without saving cache files 🤗Datasets	1	1805	April 19, 2023

Dataset map() creates lot of cache files

Related topics