Dataset map() creates lot of cache files

sriniv · July 22, 2022, 12:03pm

I am loading a csv file(about 1.5G) from disk using load_dataset(). It creates files under cache directory.

dataset = load_dataset(‘csv’, data_files=filepath)

When we apply map functions on the datasets like below, the cache size keeps growing

df= df.map(preprocess_1, num_cores=8)
df= df.map(preprocess_2, num_cores=8)

Is there a way to disable caching on each map() function applied.

I tried to disable caching at the datasets level using the following, but it still creates cache files.

from datasets import disable_caching
disable_caching()

Is there any solution/workaround to disable caching?

lhoestq · July 27, 2022, 9:29am

It does create files because it writes the resulting dataset on your disk to reload it from there using memory mapping (and save some RAM). To keep your dataset in memory instead, you can pass keep_in_memory=True to map

When caching is disabled, dataset files are written to temporary directories.

sriniv · August 2, 2022, 7:37am

Thanks, @lhoestq

sriniv · August 29, 2022, 11:05am

@lhoestq

If I am applying multiple .map() operations… as in below

ds = ds.map(preprocess1, batched=True, num_proc=8)
ds = ds.map(preprocess2, batched=True, num_proc=8)
ds = ds.map(preprocess3, batched=True, num_proc=8)
ds = ds.map(preprocess4, batched=True, num_proc=8)

As mentioned above, It creates lot of cache files at each step. Is there a way we can clean cache files after each map step?

Also, if we use keep_in_memory=True with num_proc>1, it slows down.

I am using v1.16.1 and I have certain constraints to upgrade.

Is there a better way to speed up these preprocessing steps without caching lot of files (constraint on disk space), or with keep_in_memory=True and num_proc > 1 ?

sriniv · September 14, 2022, 5:17pm

@lhoestq -
If we use keep_in_memory=True with num_proc>1, it slows down.

I am using v1.16.1 and I have certain constraints to upgrade.

Is there a better way to speed up these preprocessing steps with keep_in_memory=True and num_proc > 1 ?

lhoestq · September 15, 2022, 12:41pm

If the data is in memory, then it currently needs to be copied to the subprocesses (or to use the disk to do so). On the contrary, data loaded from your disk can be reloaded instantaneously thanks to memory mapping. That’s why starting a parallel map is usually faster with data from your disk.

It should be possible to use a Plasma store instead of copying all the data to each subprocesses but this isn’t implemented

ShoufaChen · March 26, 2024, 6:25am

Hi, @lhoestq,

Thanks for your impressive caching mechanisms.

However, I believe it is worth issuing a warning when we use cached files due to a scenario I encountered:

During my initial process function, there was a bug. After fixing the bug, I found that the program would skip the map process unless I changed the num_proc=4 parameter.

I find this behavior to be unexpected and undesirable. Thanks for considering my viewpoint.

Topic		Replies	Views
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2728	March 22, 2023
Keeping only current dataset state in cache 🤗Datasets	3	1302	August 30, 2022
Working with large datasets - cache issues 🤗Datasets	1	1028	June 1, 2022
Increase on disk space when using map() in Accelerate environment 🤗Datasets	2	1170	August 18, 2022
Multiprocessing map taking too much memory footprint 🤗Datasets	17	5810	April 5, 2024

Dataset map() creates lot of cache files

Related topics