Keeping only current dataset state in cache

marcelgwerder · June 11, 2021, 8:02pm

Hi,

How would I apply multiple processing functions on a dataset while only ever keeping one version of the dataset in cache. The code below, for example, seems to keep every state of the dataset in the /tmp directory until the session ends. This results in my dataset being duplicated on disk many times, and thus requiring a ton of disk space.

set_caching_enabled(False)

dataset = load_dataset(...)

# Creates first copy in /tmp
dataset = dataset.map(...)
# Creates second copy in /tmp
dataset = dataset.map(...)
# Creates third copy in /tmp
dataset = dataset.map(...)

dataset.save_to_disk(...)

I don’t care about the intermediate versions. I’m running this as a SageMaker processing job and my whole container is discarded afterwards. All I need is an input and an output version of the dataset. How would I get rid of the additional copies in /tmp so I can reduce my scripts’ space requirements?

Thanks,
Marcel

lhoestq · July 20, 2021, 10:27am

You can get the cache files to delete with dataset.cache_files. Feel free to delete them from your filesystem at each step:

import os

def delete_dataset(dataset):
    cached_files = [cache_file["filename"] for cache_file in dataset.cache_files]
    del dataset
    for cached_file in cached_files:
        os.remove(cached_file)

# This line creates a new dataset using map, and deletes the old dataset
dataset, _ = dataset.map(...), delete_dataset(dataset)

sriniv · August 29, 2022, 11:45am

@lhoestq - If we delete cache files after each step, will it make the processing of data using .map() slower when compared to not deleting cache files?

lhoestq · August 30, 2022, 9:57am

No it won’t make it slower: it would just remove the intermediate steps that you don’t need anymore.

Though if you re-run your processing from scratch, it will recompute everything since the cache files were removed

Topic		Replies	Views
Dataset map() creates lot of cache files 🤗Datasets	6	6463	March 26, 2024
Load_dataset without saving cache files 🤗Datasets	1	1822	April 19, 2023
In-memory dataset to disk for caching operations 🤗Datasets	1	924	May 2, 2022
Streaming dataset and cache 🤗Datasets	5	3552	August 4, 2023
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2724	March 22, 2023

Keeping only current dataset state in cache

Related topics