Running out of memory during dataset.map() with `AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")`

akt42 · May 8, 2022, 9:34pm

Hi, I’m having an issue of running out of memory when trying to use the map function on a Dataset. A simplified, (mostly) reproducible example (on a 16 GB RAM) is below.

Let’s say I have a dataset of 1000 audio files of varying lengths from 5 seconds to 20 seconds, all sampled in 16 kHz. Assume I have the following Dataset object to represent that:

import numpy as np
from datasets import Dataset

dataset_size = 1000

dummy_data = {
    "audio_file": [
        {
            'path': "ignore-me.wav",
            'array': np.random.random(
                np.random.choice(np.arange(5* 16_000, 20 * 16_000), 1)[0].astype(int)  
                # an audio file of a length between 5s and 20s
            ).astype(np.float32),
            'sampling_rate': 16000
        } for _ in range(dataset_size) 
    ],
    "labels": [np.random.choice(['A', 'B']) for _ in range(dataset_size)],
}
ds = Dataset.from_dict(dummy_data)
del dummy_data

I want to use the "facebook/hubert-large-ls960-ft" feature extractor to preprocess this data set to train a Hubert model. This is my preprocessing code.

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")
def preprocess_function(examples):
    audio_arrays = [x['array'] for x in examples['audio_file']]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=160_000,
        truncation=True,
        padding=True
    )
    return inputs

When I apply the following, it causes my machine with 16 GB RAM to crash (notebook crashes). htop suggests that the process is exhausting all the memory.

ds.map(
    preprocess_function,
    remove_columns=['audio_file'],
    batched=True)

Question:

I believe that this behaviour does not align with The magic of memory mapping. Most likely, I’m doing something wrong here. How can I fix it?

PS:

In my actual data set, there are audio clips from 5s to several minutes long and I have about 15,000 total audio clips. I believe it’s quite a small data set, considering the size of other speech data sets available.
I obtain the variable ds above on the actual data set by using the from_pandas method and applying datasets.Audio function to load as an audio clip.
Versions: transformers=4.18.0, datasets=2.1.0

Thanks!

nateraw · June 8, 2022, 12:39am

Hey there, I just ran into this issue when processing images, and found a potential solution in the docs - maybe it will work for you.

In this section of the docs, it says:

Dataset.map() takes up some memory, but you can reduce its memory requirements with the following parameters:

batch_size determines the number of examples that are processed in one call to the transform function.

writer_batch_size determines the number of processed examples that are kept in memory before they are stored away.

I wasn’t using batched=True, so I just had to change writer_batch_size to something small (in my case I set it to 10). In your case, you might want to play with changing batch_size as well.

Edit: actually I just found that this does not solve my problem, I was mistaken.

@lhoestq @mariosasko any tips here?

akt42 · June 8, 2022, 1:14am

@nateraw This is actually a duplicate question. See the solution to this question for the fix.

Spoiler: you have to pass cache_file_name in .map().

nateraw · June 8, 2022, 2:33am

Oh wow, interesting! I was using pandas too to create the dataset originally, so it was the same exact problem. Thanks for the link This should solve my issue too

Topic		Replies	Views
How to load this simple audio data set and use dataset.map without memory issues? 🤗Datasets	12	4265	December 10, 2024
Dataset.map() OSError: [Errno 12] Cannot allocate memory Beginners	0	989	October 10, 2021
Dataset map during runtime 🤗Datasets	2	1291	September 13, 2023
Datasets map keeps hanging Beginners	0	686	April 7, 2024
Multiprocessing map taking too much memory footprint 🤗Datasets	17	5811	April 5, 2024

Running out of memory during dataset.map() with `AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")`

Related topics