Keeping only current dataset state in cache

Hi,

How would I apply multiple processing functions on a dataset while only ever keeping one version of the dataset in cache. The code below, for example, seems to keep every state of the dataset in the /tmp directory until the session ends. This results in my dataset being duplicated on disk many times, and thus requiring a ton of disk space.

set_caching_enabled(False)

dataset = load_dataset(...)

# Creates first copy in /tmp
dataset = dataset.map(...)
# Creates second copy in /tmp
dataset = dataset.map(...)
# Creates third copy in /tmp
dataset = dataset.map(...)

dataset.save_to_disk(...)

I don’t care about the intermediate versions. I’m running this as a SageMaker processing job and my whole container is discarded afterwards. All I need is an input and an output version of the dataset. How would I get rid of the additional copies in /tmp so I can reduce my scripts’ space requirements?

Thanks,
Marcel

1 Like

You can get the cache files to delete with dataset.cache_files. Feel free to delete them from your filesystem at each step:

import os

def delete_dataset(dataset):
    cached_files = [cache_file["filename"] for cache_file in dataset.cache_files]
    del dataset
    for cached_file in cached_files:
        os.remove(cached_file)

# This line creates a new dataset using map, and deletes the old dataset
dataset, _ = dataset.map(...), delete_dataset(dataset)
1 Like

@lhoestq - If we delete cache files after each step, will it make the processing of data using .map() slower when compared to not deleting cache files?

No it won’t make it slower: it would just remove the intermediate steps that you don’t need anymore.

Though if you re-run your processing from scratch, it will recompute everything since the cache files were removed