How to load this simple audio data set and use dataset.map without memory issues?

Thanks @lhoestq! I think there’s something wrong here. I’ve tried with a data set size of N=10_000 and it was always crashing on colab (~13 GB RAM) even with batch_size=1.

ds = ds.map(preprocess_function, remove_columns='audio', batch_size=1)

(My code provided is reproducible in the Colab free version with N=10000).

Another observation I’ve made is that the memory usage increases somewhat linearly when ds.map() is called. Could it be that it’s not garbage collecting?

image