Thanks @lhoestq! I think there’s something wrong here. I’ve tried with a data set size of N=10_000
and it was always crashing on colab (~13 GB RAM) even with batch_size=1
.
ds = ds.map(preprocess_function, remove_columns='audio', batch_size=1)
(My code provided is reproducible in the Colab free version with N=10000
).
Another observation I’ve made is that the memory usage increases somewhat linearly when ds.map()
is called. Could it be that it’s not garbage collecting?