If I am applying multiple .map() operations… as in below
ds = ds.map(preprocess1, batched=True, num_proc=8)
ds = ds.map(preprocess2, batched=True, num_proc=8)
ds = ds.map(preprocess3, batched=True, num_proc=8)
ds = ds.map(preprocess4, batched=True, num_proc=8)
As mentioned above, It creates lot of cache files at each step. Is there a way we can clean cache files after each map step?
Also, if we use keep_in_memory=True with num_proc>1, it slows down.
I am using v1.16.1 and I have certain constraints to upgrade.
Is there a better way to speed up these preprocessing steps without caching lot of files (constraint on disk space), or with keep_in_memory=True and num_proc > 1 ?