Hi, first, thank you for open-sourcing such a good ASR project. Recently I plan to investigate whisper in my research. LoRA was applied to parameter-efficient fine-tuned on my dataset (30h Chinese Mandarin speech corpus). Before fine-tuning, the whisper can achieve about 10% WER. However, after fine-tuning, its decoding seems to have some problems that repeatedly output some tokens multiple times.
It looks like this:
Below are some of my code snippets for configuration. batch_size
=2, num_train_epochs
=3, fp16
=true.
# lora config
config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")
# training_args
training_args = Seq2SeqTrainingArguments(
output_dir=args.output_dir, # change to a repo name of your choice
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=16//batch_size, # increase by 2x for every 2x decrease in batch size
gradient_checkpointing=args.gradient_checkpoint,
learning_rate=1e-3,
warmup_steps=50,
num_train_epochs=3,
evaluation_strategy="epoch",
fp16=fp16,
per_device_eval_batch_size=16,
eval_accumulation_steps=1, # otherwise will accumulate in GPU, OOM warning!
generation_max_length=128,
logging_steps=25,
remove_unused_columns=False, # required as the PeftModel forward doesn't have the signature of the wrapped model's forward
label_names=["labels"], # same reason as above
report_to=['tensorboard'],
)
It would be much appreciated if anyone has any idea about this issue. Or please let me know if you need any more info/clues, . Thanks!