I’m trying to fine tune the whisper model with a custom data set but I’m getting a memory error.
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 779. GiB for an array with shape (480000, 435681) and data type float32
I’m using the fine-tune-whisper-non-streaming code and the error is thrown at the following line of code:
batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
I’m concerned something is wrong with the audio["array"]
part of my custom dataset. In an effort to reuse the most code I changed the format of my custom dataset to match that of the common voice dataset used. My dataset starts as a CSV in the format of “wav file, transcription”
/home/username/location/of/wavfileOne.wav,transcription of the first utterance
/home/username/location/of/wavfileTwo.wav,transcription of the second utterance
I use the following code to go from my CSV to the DataSet Dict:
test_sentence = []
test_audio = []
with open('custom_dataset.csv', 'r') as read_obj:
csv_reader = reader(read_obj)
for row in csv_reader:
path = row[0]
speech_array, sampling_rate = librosa.load(path, sr=None)
audio_stuff = {'path':row[0], 'array': speech_array, 'sampling_rate': sampling_rate}
test_audio.append(audio_stuff)
test_sentence.append(row[1])
dataset_dict = {'audio': test_audio, 'sentence': test_sentence}
custom_dataset = Dataset.from_dict(dataset_dict)
The custom_dataset[audio][array]
is of float32 just like common voice array. Again my biggest concern is if I’m creating the array data correct ie speech_array, sampling_rate = librosa.load(path, sr=None)
. I’m using a subset (5 samples) of my dataset for testing. The longest wav file in the subset is 12 seconds long. I’m able to use the same dataset to fine tune from the wav2vec XLSR pretrained models with out any issues.