Trainer freezes/crashes after evaluation step

Hi, I was trying to follow the tutorial on Fine Tuning Whisper model when I ran into this issue. Everything works fine until I run the trainer.train() function. The trainer seems to be training fine only before it reaches the evaluation step that we had set in the training arguments. After that, it doesn’t progresses any further, no error messages, it’s just stuck at the step after the evaluation step.

Here are my codes relevant to the trainer and model.

from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor

# Load feature extractor to process the raw audio inputs.
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

# Load Whisper tokenizer
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="English", task="transcribe")

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="English", task="transcribe")
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        
        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch
    
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)
import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    
    label_ids[label_ids == -100] = tokenizer.pad_token_id
    
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
    
    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    
    return {"wer": wer}
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.generation_config.language = "en"
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-eng-gen",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=2000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
    ignore_data_skip=True,
    do_eval=True
)
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)
trainer.train()

And this is the output

My data is an IterableDatasetDict, as Common Voice’s English dataset is too big to download everything onto the disk. Hence, I’m using the streaming option when loading the dataset.

Help is very much appreciated! Thanks!

I included a verbose logging snippet with

import transformers

transformers.logging.set_verbosity_info()

And this is what it tells me.

You have passed language=en, but also have set forced_decoder_ids to [[1, None], [2, 50359]] which creates a conflict. forced_decoder_ids will be ignored in favor of language=en.

Any idea on how to fix this?

Hey RitchieP

Im facing the same issue without any solution yet… Just freezing when it starts to evaluate… Do you have any updates on this / fixed it?

Best,

Hi @MikkelWK ,

I eventually found out that it is not freezing/crashing. The evaluation is slower. Probably it’s going through each data one by one during evaluation. Give it some time, it will continue.

Regarding the message

You have passed language=en, but also have set forced_decoder_ids to [[1, None], [2, 50359]] which creates a conflict. forced_decoder_ids will be ignored in favor of language=en.

It is actually an expected behavior. Every time it loops through another data, it will print it out again. Hence why it looks like it froze/crashed.

1 Like

Hi Ritchie,

Thanks for reply. I am running on a a100 GPU and it take approx 5-6 hours for it to evaluate… I will not evaluate it often then haha.

Could It be something with the IterableDatasetDict? Have you tried just downloading it all?

Best

I’ve tried downloading it all with a smaller dataset, but it’s still the same. Took my model roughly 11 hours to train with 5 evaluation steps.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.