Dataset map() creates lot of cache files

I am loading a csv file(about 1.5G) from disk using load_dataset(). It creates files under cache directory.

dataset = load_dataset(‘csv’, data_files=filepath)

When we apply map functions on the datasets like below, the cache size keeps growing

df= df.map(preprocess_1, num_cores=8)
df= df.map(preprocess_2, num_cores=8)

Is there a way to disable caching on each map() function applied.

I tried to disable caching at the datasets level using the following, but it still creates cache files.

from datasets import disable_caching
disable_caching()

Is there any solution/workaround to disable caching?

It does create files because it writes the resulting dataset on your disk to reload it from there using memory mapping (and save some RAM). To keep your dataset in memory instead, you can pass keep_in_memory=True to map

When caching is disabled, dataset files are written to temporary directories.

2 Likes

Thanks, @lhoestq

@lhoestq

If I am applying multiple .map() operations… as in below

ds = ds.map(preprocess1, batched=True, num_proc=8)
ds = ds.map(preprocess2, batched=True, num_proc=8)
ds = ds.map(preprocess3, batched=True, num_proc=8)
ds = ds.map(preprocess4, batched=True, num_proc=8)

As mentioned above, It creates lot of cache files at each step. Is there a way we can clean cache files after each map step?

Also, if we use keep_in_memory=True with num_proc>1, it slows down.

I am using v1.16.1 and I have certain constraints to upgrade.

Is there a better way to speed up these preprocessing steps without caching lot of files (constraint on disk space), or with keep_in_memory=True and num_proc > 1 ?

@lhoestq -
If we use keep_in_memory=True with num_proc>1, it slows down.

I am using v1.16.1 and I have certain constraints to upgrade.

Is there a better way to speed up these preprocessing steps with keep_in_memory=True and num_proc > 1 ?

If the data is in memory, then it currently needs to be copied to the subprocesses (or to use the disk to do so). On the contrary, data loaded from your disk can be reloaded instantaneously thanks to memory mapping. That’s why starting a parallel map is usually faster with data from your disk.

It should be possible to use a Plasma store instead of copying all the data to each subprocesses but this isn’t implemented

Hi, @lhoestq,

Thanks for your impressive caching mechanisms.

However, I believe it is worth issuing a warning when we use cached files due to a scenario I encountered:

During my initial process function, there was a bug. After fixing the bug, I found that the program would skip the map process unless I changed the num_proc=4 parameter.

I find this behavior to be unexpected and undesirable. Thanks for considering my viewpoint.