I am following the below blog post for fine-tuning the pretrained model on my custom dataset. Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers. In the blog, the author mentioned about where we can use it. Below is an excerpt for the same.
First, we load and resample the audio data, simply by calling 'batch["audio"]' . Second, we extract the 'input_values' from the loaded audio file. In our case, the Wav2Vec2Processor
only normalizes the data. For other speech models, however, this step can include more complex feature extraction, such as [Log-Mel feature extraction](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum). Third, we encode the transcriptions to label ids.
Below is the section where the change is supposed to be done.(AFAIK)
def prepare_dataset(batch): audio = batch["audio"] # batched output is "un-batched" batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values batch["input_length"] = len(batch["input_values"]) with processor.as_target_processor(): batch["labels"] = processor(batch["sentence"]).input_ids return batch
The change which I made is this line
batch["input_values"] = librosa.feature.mfcc(audio["array"], n_mfcc=13, sr=audio["sampling_rate"])
However, this doesn’t seem to work. Can somebody help me out? Thanks.