Dataset map during runtime

pritamSinha · September 9, 2023, 6:50am

Hi! I am currently using the datasets library for the Trainer function to fine-tune a pre-trained model. I have a large dataset. I want to know if is it possible to execute the dataset.map() function during runtime. Instead of transforming all the data at once.

def prepare_dataset(batch):
    audio = batch["audio"]
    wav, sr = librosa.load(audio, sr=16000)
    batch["input_values"] = feature_extractor(wav, sampling_rate=sr).input_values[0]
    batch["labels"] = tokenizer(batch["sentence"]).input_ids

    return batch

common_voice = dataset.map(prepare_dataset, remove_columns=['audio', 'sentence'])

If I run this all the data are transformed all at once and it causes memory issues. Is it possible to run during runtime, like a pipeline, it will take only some rows map it, train it and so on.

Any insights is helpful. Thanks

ReatKay · September 10, 2023, 4:28pm

What may help:

common_voice = dataset.map(prepare_dataset, remove_columns=['audio', 'sentence'], batch_writer_size=500)

Basically this loads only half the number of data into memory for processing (default is 1’000 per batch). I do this, when I want to speed up mapping and using multiple dataloaders to not run into OOM. For each additional dataloader I half the batch_writer_size. Also when I process datasets with heavier rows, I reduce the batch writer size to be able to handle the dataset for example on the road on my notebook.

As long as you have disk based caching enabled (and not enforce keeping in memory), this should work for you too.

mariosasko · September 13, 2023, 4:10pm

Small correction: the param name is writer_batch_size, not batch_writer_size.

Topic		Replies	Views
Using a generator for the map function in an iterable dataset Beginners	1	426	December 30, 2023
Mapping large datasets 🤗Datasets	4	526	February 15, 2022
Streaming datasets and batched mapping 🤗Datasets	5	2666	January 10, 2022
Running out of memory during dataset.map() with `AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")` Beginners	3	3490	June 8, 2022
Dataset and Training Batching Beginners	1	1434	February 9, 2022

Dataset map during runtime

Related topics