Hi, I’m having an issue of running out of memory when trying to use the map
function on a Dataset. A simplified, (mostly) reproducible example (on a 16 GB RAM) is below.
Let’s say I have a dataset of 1000 audio files of varying lengths from 5 seconds to 20 seconds, all sampled in 16 kHz. Assume I have the following Dataset
object to represent that:
import numpy as np
from datasets import Dataset
dataset_size = 1000
dummy_data = {
"audio_file": [
{
'path': "ignore-me.wav",
'array': np.random.random(
np.random.choice(np.arange(5* 16_000, 20 * 16_000), 1)[0].astype(int)
# an audio file of a length between 5s and 20s
).astype(np.float32),
'sampling_rate': 16000
} for _ in range(dataset_size)
],
"labels": [np.random.choice(['A', 'B']) for _ in range(dataset_size)],
}
ds = Dataset.from_dict(dummy_data)
del dummy_data
I want to use the "facebook/hubert-large-ls960-ft"
feature extractor to preprocess this data set to train a Hubert model. This is my preprocessing code.
from transformers import AutoFeatureExtractor
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")
def preprocess_function(examples):
audio_arrays = [x['array'] for x in examples['audio_file']]
inputs = feature_extractor(
audio_arrays,
sampling_rate=feature_extractor.sampling_rate,
max_length=160_000,
truncation=True,
padding=True
)
return inputs
When I apply the following, it causes my machine with 16 GB RAM to crash (notebook crashes). htop
suggests that the process is exhausting all the memory.
ds.map(
preprocess_function,
remove_columns=['audio_file'],
batched=True)
Question:
I believe that this behaviour does not align with The magic of memory mapping. Most likely, I’m doing something wrong here. How can I fix it?
PS:
- In my actual data set, there are audio clips from 5s to several minutes long and I have about 15,000 total audio clips. I believe it’s quite a small data set, considering the size of other speech data sets available.
- I obtain the variable
ds
above on the actual data set by using thefrom_pandas
method and applyingdatasets.Audio
function to load as an audio clip. - Versions:
transformers=4.18.0
,datasets=2.1.0
Thanks!