Hey @navissivan
-
Sorry, I thought about it and we donât need
group_by_length
: group by length sorts together samples of roughly the same length in the training dataset (to minimize padding applied and be more efficient). But since all of our samples are padded/truncated to 30s by the Whisper feature extractor, the padding is the same for all samples. Long story short, setgroup_by_length=False
. This will mean training starts immediately! Iâve updated the template Colab to reflect this. -
Oh thatâs strange - is Trainer definitely performing evaluation? Do you see the message RUNNING EVALUATION pop up on the traceback? The progress bar will definitely show if you run it as a python script - might be something to do with the notebook environment. Could you also check the âREADME.mdâ in your output directory? The table should have saved there as well
-
You set the Trainer as follows with your train and val set:
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=fleurs_ch["train"],
eval_dataset=fleurs_ch["validation"], # validation set
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor,
)
And then after training, you run an extra âpredictionâ step on your test set:
predict_results = trainer.predict(fleurs_ch["test"], metric_key_prefix="test")
metrics = predict_results.metrics
trainer.log_metrics("test", metrics)
trainer.save_metrics("test", metrics)
New qâs:
- I didnât include a normaliser to keep the example streamlined. You can certainly include one if you donât care about casing or punctuation in your transcriptions. Yep, this is the way to do it, but make sure you apply the normaliser to your label string and prediction string:
# normalizer
pred_str = custom_normalizer(pred_str, "zh")
label_str = custom_normalizer(label_str, "zh")
And here the
tokenizer.batch_decode
is the same asprocessor.batch_decode
right?
Correct!
How do I continue training from a checkpoint?
This has already been asked before