I have a large dataset of over 10000 audio data. whenever I tried to map it, it exceeded available memory. I have used batching but still it failed. any better way to do it?
Hi! You can lower memory usage by reducing batch_size
in map
(default is 1000).
Thanks Mario. I have used that. But that didn’t work for me. any other Ideas?
I think the parameter you’re looking for is the writer_batch_size
: it’s the number of rows per write operation for the cache file writer.
Default is currently 1,000. Higher value makes the processing do fewer data flushing, lower value consume less temporary memory while running .map()
.
Feel free to set both batch_size
and writer_batch_size
to lower values
cc @mariosasko don’t you think we should have writer_batch_size = batch_size by default ?
1 Like
@lhoestq Yes, I agree. We can set the default value of writer_batch_size
to None
in the signature of map
, and do:
writer_batch_size = batch_size if writer_batch_size is None else writer_batch_size
in the body.
1 Like