I was wondering if there was a way to do generate
for a ViT to GPT2 EncoderDecoderModel
. I managed to figure out how to get the loss by using the outputs of the ViT and pushing in the encoder_outputs
into the model as shown below. However, it seems that for generate
it is explicitly expecting inputs_ids
. I’m fairly certain that somewhere under the hood all you need is just the encoder_outputs
(and inputs_ids
is unnecessary in that case). Is there a way to do this?
Also I realise that there is a VisionEncoderDecoderModel
but I am trying to do this as a learning exercise.
vit2gpt2 = EncoderDecoderModel.from_encoder_decoder_pretrained(VIT_MODEL, DISTIL_GPT2) tokenized_captions = gpt2_tokenizer_fn(captions) labels = tokenized_captions[“input_ids”].clone() labels[tokenized_captions[“attention_mask”]==0] = LABEL_MASK encoder_outputs = vit2gpt2.encoder(pixel_values=images) outputs = vit2gpt2( encoder_outputs=encoder_outputs, decoder_input_ids=tokenized_captions[“input_ids”], decoder_attention_mask=tokenized_captions[“attention_mask”], labels=labels, return_dict=True, )
Here is a kaggle kernel to a runnable version of above snippet.
Update 1:
So I’ve managed to narrow down my necessary search down to generation_utils.py
but I cannot see where in there it loops over the predicted values and feeds it back into the model. I’m hoping to replicate the process from there.