How to load this simple audio data set and use dataset.map without memory issues?

Hi, I have an audio data set of the following format, which has 16 kHz audio files in a one folder named “audio” and a pandas dataframe of labels with audio to label mapping.

(Code to create this data set is at the end of this post)

image

Question:

What is the standard way to create a dataset from this data set to train an audio classification model?
More specifically, how can I use the facebook/hubert-large-ls960-ft feature extractor to create a Dataset to train a Hubert model? I have the additional requirements of truncating/padding input size to 10 seconds, which I’ve done in the preprocess_function below.

What I tried:

import numpy as np
import os
import pandas as pd
import soundfile as sf

from datasets import Dataset, Audio
from transformers import Wav2Vec2Processor


# creating the dataset from pandas
ds = Dataset.from_pandas(labels)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

# feature extractor
feature_extractor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")

def preprocess_function(examples):

    audio_arrays = [examples['audio']['array']]
    inputs = feature_extractor(
        audio_arrays, 
        sampling_rate=16_000, 
        max_length=int(16_000 * 10),  # 10s
        truncation=True, 
    )
    return inputs

# map the preprocessing function
ds = ds.map(preprocess_function, remove_columns='audio')

This works fine when the data set is small. But fails when there are many audio files (N~10000) in the data set due to the map operation exhausting the memory. I’m probably doing something wrong because this clearly does not align with The magic of memory mapping . What am I doing wrong? Thanks!

Code to create the data set:

# number of examples
N = 10

# labels file
labels = pd.DataFrame({
    'audio': [os.path.join('audio_dir', f"{i}.wav") for i in range(N)],
    'label': np.random.choice(['A', 'B'], N)
})

# save dummy audio files
os.makedirs("audio_dir", exist_ok=True)
for file_path in labels['audio']:
    dummmy_audio = np.random.randn(np.random.choice(np.arange(80_000, 240_000)).astype(int))  # between 5s - 15s long
    sf.write(file_path, dummmy_audio, 16_000)
1 Like

Hi ! This is a good way to define a dataset for audio classification :slight_smile:

During map, only one batch at a time is loaded in memory and passed to your preprocess_function . To use less memory you can try to reduce the writer_batch_size (default is 1,000) :wink:

ds = ds.map(preprocess_function, remove_columns='audio', writer_batch_size=100)

EDIT: changed batch_size to writer_batch_size

1 Like

Thanks @lhoestq! I think there’s something wrong here. I’ve tried with a data set size of N=10_000 and it was always crashing on colab (~13 GB RAM) even with batch_size=1.

ds = ds.map(preprocess_function, remove_columns='audio', batch_size=1)

(My code provided is reproducible in the Colab free version with N=10000).

Another observation I’ve made is that the memory usage increases somewhat linearly when ds.map() is called. Could it be that it’s not garbage collecting?

image

Actually this is writer_batch_size that you have to set, sorry.

(batch_size is for batched map, i.e. when you want your function to take several examples at once)

Let me know if it helps, otherwise this could be an issue with audio file not being closed correctly

Thanks @lhoestq, unfortunately, it’s the same even when I try with the smallest possible values for N=10000. Could it be that I’m making some mistake somewhere else in my code (I mean the provided minimal example).

ds = ds.map(
    preprocess_function,
    remove_columns='audio',
    batch_size=1,
    writer_batch_size=1
)
1 Like

No it looks fine (no need to provide batch_size though, since by default batched=False).

Are you running on colab ?

Maybe related to `load_dataset` consumes too much memory · Issue #4057 · huggingface/datasets · GitHub

I just created this reproducible example for colab. But I get this issue on a larger data set on another machine with 16 GB RAM - I think 16 GB would be enough given that the generators aren’t supposed to process in memory.

Do you think filing an issue would help?

Can you check ds.cache_files ? Since you loaded the dataset from memory using .from_pandas, then the dataset has no associated cache directory to save intermediate results.

To fix this you can specify cache_file_name in .map(), this way it will write the results on your disk instead of using memory :wink:

5 Likes

Wow! That has worked!

When I check ds.cache_files, that returned an empty list.

Then I’ve tried with ds = ds.map(preprocess_function, remove_columns='audio', cache_file_name='test') and it worked with no issues at all. Also then, the ds.cache_files became [{'filename': 'test'}]

Thanks a lot for your help.

If you don’t mind me asking, how did you get this?

Since you loaded the dataset from memory using .from_pandas , then the dataset has no associated cache directory to save intermediate results.

I’ve read the docs for days but was never able to figure this out.

3 Likes

I guess we have to add it to the documention :stuck_out_tongue:

Basically a Dataset os just a wrapper of an Arrow table. It can be a InMemoryTable, a MemoryMappedTable (i.e. from local files), or a combination of both as a ConcatenationTable. You can check the Arrow table with ds.data

2 Likes

Thanks for taking the time to explain!

1 Like

Hey! I spent some days trying to understand this, constantly getting OOM. And setting cache_file_name='test' was a bit brittle, as it would just use that cache no matter the fingerprint.

It seems like the datasets.from_dict() doesnt have any cache files, so I had to save to csv and then load with the csv-loader (which seemed to have some cache functionality):

    pd.DataFrame({'id' : folders}).to_csv("file.csv", index=False)
    ds_ids = datasets.Dataset.from_csv("file.csv")
1 Like

we should change from_dict() and make it have it’s own cache directory tbh, people shouldn’t be looking for this unexpected source of OOM. I added a note in the docs at More docs to from_dict to mention that the result lives in RAM by lhoestq · Pull Request #7316 · huggingface/datasets · GitHub

2 Likes