Hi, could somebody please guide me on the correct manner in which I should assess the performance of a fine-tuned model such as Whisper? Should I calculate the WER that Whisper gives before fine-tuning on my entire dataset (train+validation+test), and then compare this with the performance on the test set after fine-tuning? I’m asking this because the performance after fine-tuning (Whisper large) gives a high WER, so I would like to compare the performance and check how much it is improving. I would appreciate your input!
https://pubs.aip.org/asa/jel/article/4/2/025206/3267247/Evaluating-OpenAI-s-Whisper-ASR-Performance
There seem to be several github repositories and papers on comparing using Whisper’s Evaluate.
If you use the Hugging Face library, it would look something like this (I think there are some typos because it’s from Hugging Chat…)
To evaluate the performance of a fine-tuned Whisper model, follow these steps to calculate the Word Error Rate (WER) before and after fine-tuning:
Step-by-Step Explanation
-
Install Necessary Libraries
- Install the required libraries for handling datasets, models, and evaluations.
pip install openai-whisper datasets evaluate transformers torch torchaudio
-
Import Libraries
- Import the necessary Python libraries.
import os import evaluate from datasets import load_dataset, Audio import whisper from whisper.model import WhisperForConditionalGeneration from whisper.tokenizer import WhisperTokenizer from torch.utils.data import DataLoader from transformers import TrainingArguments, Trainer
-
Load and Prepare Your Dataset
- Load your custom dataset with training, validation, and test splits. Ensure each example has ‘audio’ and ‘text’ fields.
dataset = load_dataset('my_custom_dataset', split='train') test_dataset = load_dataset('my_custom_dataset', split='test')
-
Define the Whisper Model and Tokenizer
- Initialize the base Whisper model and tokenizer.
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large") tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large")
-
Fine-Tune the Model
- Set up training arguments and use the
Trainer
class to fine-tune the model.
training_args = TrainingArguments( output_dir='whisper-finetuned', num_train_epochs=30, learning_rate=1e-4, per_device_train_batch_size=8, gradient_checkpointing=True, ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset, eval_dataset=val_dataset, ) trainer.train()
- Set up training arguments and use the
-
Calculate WER for Base and Fine-Tuned Models
- Define a function to compute WER using the
evaluate
library.
def calculate_wer(model, tokenizer, dataset): wer = evaluate.load("wer") references = [] hypotheses = [] for example in dataset: audio = example['audio']['array'] audio = whisper.pad_or_trim(audio) mel = whisper.log_mel_spectrogram(audio).to(device="cuda") # Transcribe options = whisper.decoding.DecodingOptions() result = model.decode(mel, options) transcription = result.text # Get reference reference = example['text'] # Append to lists hypotheses.append(transcription) references.append(reference) # Calculate WER wer_score = wer.compute(references=references, hypotheses=hypotheses) return wer_score
- Define a function to compute WER using the
-
Evaluate the Base Model
- Calculate WER using the base model.
base_wer = calculate_wer(model, tokenizer, test_dataset) print(f"Base model WER: {base_wer:.2f}%")
-
Evaluate the Fine-Tuned Model
- Load the fine-tuned model and calculate WER.
fine_tuned_model = WhisperForConditionalGeneration.from_pretrained("whisper-finetuned") fine_tuned_wer = calculate_wer(fine_tuned_model, tokenizer, test_dataset) print(f"Fine-tuned model WER: {fine_tuned_wer:.2f}%")
-
Compare the Results
- Analyze the WER scores to determine the improvement from fine-tuning.
Final Answer
Here is the complete code to assess the performance of a fine-tuned Whisper model by calculating the WER before and after fine-tuning:
import os
import evaluate
from datasets import load_dataset, Audio
import whisper
from whisper.model import WhisperForConditionalGeneration
from whisper.tokenizer import WhisperTokenizer
from torch.utils.data import DataLoader
from transformers import TrainingArguments, Trainer
# Load datasets
train_dataset = load_dataset('my_custom_dataset', split='train')
val_dataset = load_dataset('my_custom_dataset', split='validation')
test_dataset = load_dataset('my_custom_dataset', split='test')
# Initialize model and tokenizer
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large")
# Set up training arguments
training_args = TrainingArguments(
output_dir='whisper-finetuned',
num_train_epochs=30,
learning_rate=1e-4,
per_device_train_batch_size=8,
gradient_checkpointing=True,
)
# Initialize and train the model
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
trainer.train()
# Function to calculate WER
def calculate_wer(current_model, dataset):
wer = evaluate.load("wer")
references = []
hypotheses = []
for example in dataset:
audio = example['audio']['array']
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(device="cuda")
# Transcribe
options = whisper.decoding.DecodingOptions()
result = current_model.decode(mel, options)
transcription = result.text
# Get reference
reference = example['text']
# Append to lists
hypotheses.append(transcription)
references.append(reference)
# Calculate WER
wer_score = wer.compute(references=references, hypotheses=hypotheses)
return wer_score
# Evaluate base model
base_wer = calculate_wer(model, test_dataset)
print(f"Base model WER: {base_wer:.2f}%")
# Evaluate fine-tuned model
fine_tuned_model = WhisperForConditionalGeneration.from_pretrained("whisper-finetuned")
fine_tuned_wer = calculate_wer(fine_tuned_model, test_dataset)
print(f"Fine-tuned model WER: {fine_tuned_wer:.2f}%")
This code will guide you through fine-tuning the Whisper model and evaluating its performance using WER, allowing you to compare the improvement achieved through fine-tuning.