EncoderDecoderModel generate text for a ViT as encoder

I was wondering if there was a way to do generate for a ViT to GPT2 EncoderDecoderModel . I managed to figure out how to get the loss by using the outputs of the ViT and pushing in the encoder_outputs into the model as shown below. However, it seems that for generate it is explicitly expecting inputs_ids . I’m fairly certain that somewhere under the hood all you need is just the encoder_outputs (and inputs_ids is unnecessary in that case). Is there a way to do this?

Also I realise that there is a VisionEncoderDecoderModel but I am trying to do this as a learning exercise.

vit2gpt2 = EncoderDecoderModel.from_encoder_decoder_pretrained(VIT_MODEL, DISTIL_GPT2) tokenized_captions = gpt2_tokenizer_fn(captions) labels = tokenized_captions[“input_ids”].clone() labels[tokenized_captions[“attention_mask”]==0] = LABEL_MASK encoder_outputs = vit2gpt2.encoder(pixel_values=images) outputs = vit2gpt2( encoder_outputs=encoder_outputs, decoder_input_ids=tokenized_captions[“input_ids”], decoder_attention_mask=tokenized_captions[“attention_mask”], labels=labels, return_dict=True, )

Here is a kaggle kernel to a runnable version of above snippet.

Update 1:

So I’ve managed to narrow down my necessary search down to generation_utils.py but I cannot see where in there it loops over the predicted values and feeds it back into the model. I’m hoping to replicate the process from there.