Dataset map during runtime

ReatKay · September 10, 2023, 4:28pm

What may help:

common_voice = dataset.map(prepare_dataset, remove_columns=['audio', 'sentence'], batch_writer_size=500)

Basically this loads only half the number of data into memory for processing (default is 1’000 per batch). I do this, when I want to speed up mapping and using multiple dataloaders to not run into OOM. For each additional dataloader I half the batch_writer_size. Also when I process datasets with heavier rows, I reduce the batch writer size to be able to handle the dataset for example on the road on my notebook.

As long as you have disk based caching enabled (and not enforce keeping in memory), this should work for you too.

Topic		Replies	Views
Running out of memory during dataset.map() with `AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")` Beginners	3	3631	June 8, 2022
How to load this simple audio data set and use dataset.map without memory issues? 🤗Datasets	12	4327	December 10, 2024
Ideal batch_size and writer_batch_size for datasets 🤗Datasets	1	1678	December 9, 2022
Image dataset performance when using map 🤗Datasets	0	122	June 24, 2024
Datasets map keeps hanging Beginners	0	720	April 7, 2024

Dataset map during runtime

Related topics