Keeping only current dataset state in cache


How would I apply multiple processing functions on a dataset while only ever keeping one version of the dataset in cache. The code below, for example, seems to keep every state of the dataset in the /tmp directory until the session ends. This results in my dataset being duplicated on disk many times, and thus requiring a ton of disk space.


dataset = load_dataset(...)

# Creates first copy in /tmp
dataset =
# Creates second copy in /tmp
dataset =
# Creates third copy in /tmp
dataset =


I don’t care about the intermediate versions. I’m running this as a SageMaker processing job and my whole container is discarded afterwards. All I need is an input and an output version of the dataset. How would I get rid of the additional copies in /tmp so I can reduce my scripts’ space requirements?


You can get the cache files to delete with dataset.cache_files. Feel free to delete them from your filesystem at each step:

import os

def delete_dataset(dataset):
    cached_files = [cache_file["filename"] for cache_file in dataset.cache_files]
    del dataset
    for cached_file in cached_files:

# This line creates a new dataset using map, and deletes the old dataset
dataset, _ =, delete_dataset(dataset)