Hi, I have recently finetuned a SpeechEncoderDecoderModel for speech translation under low-resource conditions, the speech encoder is a xls-r-300m, while the decoder is mbart-large. The model converged and showed a nice loss curve that ends with relatively small loss values. However, I observed that when I try to translate an audio, if I use prediction = model(input_values=batch["input_values"], input_ids=batch["labels"]["input_ids"]), then the results are fine (BLEU of 9.4). But if I use prediction = model.generate(input_values = batch["input_values"], attention_mask = batch["attention_mask"]), then the result will have a significant drop (BLEU of 1.78).
I read that the generate method will try to infer autoregressively. Is that the reason why there are such a big drop of performance? Or am I doing something not right?
Thanks!