I try currently to setup your VisionEncoderDecoderModel with a pretrained bert model as decoder but have some struggle with the model.generate (greedy search) part.

I have setup all i have in a Colab Notebook (else it would be to much for posting directly)
Colab Notebook (Note: it is open if anyone is interested in collaborating)

What are the difference between decoder_input_ids and labels ? (I think decoder_input_ids are for training the tokenized labels and labels the complete vocab ids right ?) (I ask in case for the DecoderEncoder model that the tokenized labels are used for both decoder_input_ids and labels)

I would like to track the CER / WER metrics in validation step is there another way without model.generate ?

Do you see a way to export this model after training to ONNX format ? I think if this will work i need to implement the greedy search by hand or do you have some solution into the transformers lib currently ?

Thanks a lot :hugs: