How would I apply multiple processing functions on a dataset while only ever keeping one version of the dataset in cache. The code below, for example, seems to keep every state of the dataset in the
/tmp directory until the session ends. This results in my dataset being duplicated on disk many times, and thus requiring a ton of disk space.
set_caching_enabled(False) dataset = load_dataset(...) # Creates first copy in /tmp dataset = dataset.map(...) # Creates second copy in /tmp dataset = dataset.map(...) # Creates third copy in /tmp dataset = dataset.map(...) dataset.save_to_disk(...)
I don’t care about the intermediate versions. I’m running this as a SageMaker processing job and my whole container is discarded afterwards. All I need is an input and an output version of the dataset. How would I get rid of the additional copies in
/tmp so I can reduce my scripts’ space requirements?