Dataset map during runtime

What may help:

common_voice = dataset.map(prepare_dataset, remove_columns=['audio', 'sentence'], batch_writer_size=500)

Basically this loads only half the number of data into memory for processing (default is 1’000 per batch). I do this, when I want to speed up mapping and using multiple dataloaders to not run into OOM. For each additional dataloader I half the batch_writer_size. Also when I process datasets with heavier rows, I reduce the batch writer size to be able to handle the dataset for example on the road on my notebook.

As long as you have disk based caching enabled (and not enforce keeping in memory), this should work for you too.