I was wondering if there was a way to do
generate for a ViT to GPT2
EncoderDecoderModel . I managed to figure out how to get the loss by using the outputs of the ViT and pushing in the
encoder_outputs into the model as shown below. However, it seems that for
generate it is explicitly expecting
inputs_ids . I’m fairly certain that somewhere under the hood all you need is just the
inputs_ids is unnecessary in that case). Is there a way to do this?
Also I realise that there is a
VisionEncoderDecoderModel but I am trying to do this as a learning exercise.
vit2gpt2 = EncoderDecoderModel.from_encoder_decoder_pretrained(VIT_MODEL, DISTIL_GPT2) tokenized_captions = gpt2_tokenizer_fn(captions) labels = tokenized_captions[“input_ids”].clone() labels[tokenized_captions[“attention_mask”]==0] = LABEL_MASK encoder_outputs = vit2gpt2.encoder(pixel_values=images) outputs = vit2gpt2( encoder_outputs=encoder_outputs, decoder_input_ids=tokenized_captions[“input_ids”], decoder_attention_mask=tokenized_captions[“attention_mask”], labels=labels, return_dict=True, )
Here is a kaggle kernel to a runnable version of above snippet.
So I’ve managed to narrow down my necessary search down to
generation_utils.py but I cannot see where in there it loops over the predicted values and feeds it back into the model. I’m hoping to replicate the process from there.