Hi,
How would I apply multiple processing functions on a dataset while only ever keeping one version of the dataset in cache. The code below, for example, seems to keep every state of the dataset in the /tmp
directory until the session ends. This results in my dataset being duplicated on disk many times, and thus requiring a ton of disk space.
set_caching_enabled(False)
dataset = load_dataset(...)
# Creates first copy in /tmp
dataset = dataset.map(...)
# Creates second copy in /tmp
dataset = dataset.map(...)
# Creates third copy in /tmp
dataset = dataset.map(...)
dataset.save_to_disk(...)
I don’t care about the intermediate versions. I’m running this as a SageMaker processing job and my whole container is discarded afterwards. All I need is an input and an output version of the dataset. How would I get rid of the additional copies in /tmp
so I can reduce my scripts’ space requirements?
Thanks,
Marcel