Hi, I have an audio data set of the following format, which has 16 kHz audio files in a one folder named “audio” and a pandas dataframe of labels
with audio
to label
mapping.
(Code to create this data set is at the end of this post)
Question:
What is the standard way to create a dataset from this data set to train an audio classification model?
More specifically, how can I use the facebook/hubert-large-ls960-ft
feature extractor to create a Dataset to train a Hubert model? I have the additional requirements of truncating/padding input size to 10 seconds, which I’ve done in the preprocess_function
below.
What I tried:
import numpy as np
import os
import pandas as pd
import soundfile as sf
from datasets import Dataset, Audio
from transformers import Wav2Vec2Processor
# creating the dataset from pandas
ds = Dataset.from_pandas(labels)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
# feature extractor
feature_extractor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
def preprocess_function(examples):
audio_arrays = [examples['audio']['array']]
inputs = feature_extractor(
audio_arrays,
sampling_rate=16_000,
max_length=int(16_000 * 10), # 10s
truncation=True,
)
return inputs
# map the preprocessing function
ds = ds.map(preprocess_function, remove_columns='audio')
This works fine when the data set is small. But fails when there are many audio files (N~10000) in the data set due to the map
operation exhausting the memory. I’m probably doing something wrong because this clearly does not align with The magic of memory mapping . What am I doing wrong? Thanks!
Code to create the data set:
# number of examples
N = 10
# labels file
labels = pd.DataFrame({
'audio': [os.path.join('audio_dir', f"{i}.wav") for i in range(N)],
'label': np.random.choice(['A', 'B'], N)
})
# save dummy audio files
os.makedirs("audio_dir", exist_ok=True)
for file_path in labels['audio']:
dummmy_audio = np.random.randn(np.random.choice(np.arange(80_000, 240_000)).astype(int)) # between 5s - 15s long
sf.write(file_path, dummmy_audio, 16_000)