Dataset map during runtime

Hi! I am currently using the datasets library for the Trainer function to fine-tune a pre-trained model. I have a large dataset. I want to know if is it possible to execute the dataset.map() function during runtime. Instead of transforming all the data at once.

def prepare_dataset(batch):
    audio = batch["audio"]
    wav, sr = librosa.load(audio, sr=16000)
    batch["input_values"] = feature_extractor(wav, sampling_rate=sr).input_values[0]
    batch["labels"] = tokenizer(batch["sentence"]).input_ids

    return batch

common_voice = dataset.map(prepare_dataset, remove_columns=['audio', 'sentence'])

If I run this all the data are transformed all at once and it causes memory issues. Is it possible to run during runtime, like a pipeline, it will take only some rows map it, train it and so on.

Any insights is helpful. Thanks

What may help:

common_voice = dataset.map(prepare_dataset, remove_columns=['audio', 'sentence'], batch_writer_size=500)

Basically this loads only half the number of data into memory for processing (default is 1’000 per batch). I do this, when I want to speed up mapping and using multiple dataloaders to not run into OOM. For each additional dataloader I half the batch_writer_size. Also when I process datasets with heavier rows, I reduce the batch writer size to be able to handle the dataset for example on the road on my notebook.

As long as you have disk based caching enabled (and not enforce keeping in memory), this should work for you too.

Small correction: the param name is writer_batch_size, not batch_writer_size.