Running out of memory during dataset.map() with `AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")`

Hi, I’m having an issue of running out of memory when trying to use the map function on a Dataset. A simplified, (mostly) reproducible example (on a 16 GB RAM) is below.

Let’s say I have a dataset of 1000 audio files of varying lengths from 5 seconds to 20 seconds, all sampled in 16 kHz. Assume I have the following Dataset object to represent that:

import numpy as np
from datasets import Dataset

dataset_size = 1000

dummy_data = {
    "audio_file": [
        {
            'path': "ignore-me.wav",
            'array': np.random.random(
                np.random.choice(np.arange(5* 16_000, 20 * 16_000), 1)[0].astype(int)  
                # an audio file of a length between 5s and 20s
            ).astype(np.float32),
            'sampling_rate': 16000
        } for _ in range(dataset_size) 
    ],
    "labels": [np.random.choice(['A', 'B']) for _ in range(dataset_size)],
}
ds = Dataset.from_dict(dummy_data)
del dummy_data

I want to use the "facebook/hubert-large-ls960-ft" feature extractor to preprocess this data set to train a Hubert model. This is my preprocessing code.

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")
def preprocess_function(examples):
    audio_arrays = [x['array'] for x in examples['audio_file']]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=160_000,
        truncation=True,
        padding=True
    )
    return inputs

When I apply the following, it causes my machine with 16 GB RAM to crash (notebook crashes). htop suggests that the process is exhausting all the memory.

ds.map(
    preprocess_function,
    remove_columns=['audio_file'],
    batched=True)

Question:

I believe that this behaviour does not align with The magic of memory mapping. Most likely, I’m doing something wrong here. How can I fix it?

PS:

  1. In my actual data set, there are audio clips from 5s to several minutes long and I have about 15,000 total audio clips. I believe it’s quite a small data set, considering the size of other speech data sets available.
  2. I obtain the variable ds above on the actual data set by using the from_pandas method and applying datasets.Audio function to load as an audio clip.
  3. Versions: transformers=4.18.0, datasets=2.1.0

Thanks!

Hey there, I just ran into this issue when processing images, and found a potential solution in the docs :smile: - maybe it will work for you.

In this section of the docs, it says:

Dataset.map() takes up some memory, but you can reduce its memory requirements with the following parameters:

  • batch_size determines the number of examples that are processed in one call to the transform function.
  • writer_batch_size determines the number of processed examples that are kept in memory before they are stored away.

I wasn’t using batched=True, so I just had to change writer_batch_size to something small (in my case I set it to 10). In your case, you might want to play with changing batch_size as well.


Edit: actually I just found that this does not solve my problem, I was mistaken. :cry:

@lhoestq @mariosasko any tips here?

@nateraw This is actually a duplicate question. See the solution to this question for the fix.

Spoiler: you have to pass cache_file_name in .map().

Oh wow, interesting! I was using pandas too to create the dataset originally, so it was the same exact problem. Thanks for the link :slight_smile: This should solve my issue too