How to use MFCC feature extraction method while fine-tuning the pretrained model?

I am following the below blog post for fine-tuning the pretrained model on my custom dataset. Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers. In the blog, the author mentioned about where we can use it. Below is an excerpt for the same.

First, we load and resample the audio data, simply by calling 'batch["audio"]' . Second, we extract the 'input_values' from the loaded audio file. In our case, the Wav2Vec2Processor only normalizes the data. For other speech models, however, this step can include more complex feature extraction, such as [Log-Mel feature extraction](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum). Third, we encode the transcriptions to label ids.

Below is the section where the change is supposed to be done.(AFAIK)

def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched"
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])
    
    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch

The change which I made is this line
batch["input_values"] = librosa.feature.mfcc(audio["array"], n_mfcc=13, sr=audio["sampling_rate"])

However, this doesn’t seem to work. Can somebody help me out? Thanks.

I have exactly the same problem, can somebody help me please?