Mapping large datasets

I have a large dataset of over 10000 audio data. whenever I tried to map it, it exceeded available memory. I have used batching but still it failed. any better way to do it?

Hi! You can lower memory usage by reducing batch_size in map (default is 1000).

Thanks Mario. I have used that. But that didn’t work for me. any other Ideas?

I think the parameter you’re looking for is the writer_batch_size: it’s the number of rows per write operation for the cache file writer.

Default is currently 1,000. Higher value makes the processing do fewer data flushing, lower value consume less temporary memory while running .map().

Feel free to set both batch_size and writer_batch_size to lower values :slight_smile:

cc @mariosasko don’t you think we should have writer_batch_size = batch_size by default ?

1 Like

@lhoestq Yes, I agree. We can set the default value of writer_batch_size to None in the signature of map, and do:

writer_batch_size = batch_size if writer_batch_size is None else writer_batch_size

in the body.

1 Like