Hi, I have recently finetuned a SpeechEncoderDecoderModel
for speech translation under low-resource conditions, the speech encoder is a xls-r-300m
, while the decoder is mbart-large
. The model converged and showed a nice loss curve that ends with relatively small loss values. However, I observed that when I try to translate an audio, if I use prediction = model(input_values=batch["input_values"], input_ids=batch["labels"]["input_ids"])
, then the results are fine (BLEU of 9.4). But if I use prediction = model.generate(input_values = batch["input_values"], attention_mask = batch["attention_mask"])
, then the result will have a significant drop (BLEU of 1.78).
I read that the generate
method will try to infer autoregressively. Is that the reason why there are such a big drop of performance? Or am I doing something not right?
Thanks!