EncoderDecoderModel generate text for a ViT as encoder

sachin · November 27, 2021, 9:40am

I was wondering if there was a way to do generate for a ViT to GPT2 EncoderDecoderModel . I managed to figure out how to get the loss by using the outputs of the ViT and pushing in the encoder_outputs into the model as shown below. However, it seems that for generate it is explicitly expecting inputs_ids . I’m fairly certain that somewhere under the hood all you need is just the encoder_outputs (and inputs_ids is unnecessary in that case). Is there a way to do this?

Also I realise that there is a VisionEncoderDecoderModel but I am trying to do this as a learning exercise.

vit2gpt2 = EncoderDecoderModel.from_encoder_decoder_pretrained(VIT_MODEL, DISTIL_GPT2) tokenized_captions = gpt2_tokenizer_fn(captions) labels = tokenized_captions[“input_ids”].clone() labels[tokenized_captions[“attention_mask”]==0] = LABEL_MASK encoder_outputs = vit2gpt2.encoder(pixel_values=images) outputs = vit2gpt2( encoder_outputs=encoder_outputs, decoder_input_ids=tokenized_captions[“input_ids”], decoder_attention_mask=tokenized_captions[“attention_mask”], labels=labels, return_dict=True, )

Here is a kaggle kernel to a runnable version of above snippet.

Update 1:

So I’ve managed to narrow down my necessary search down to generation_utils.py but I cannot see where in there it loops over the predicted values and feeds it back into the model. I’m hoping to replicate the process from there.

Topic		Replies	Views
Using EncoderDecoderModel 🤗Transformers	4	1071	October 28, 2021
How to implement custom vision encoder-decoder? 🤗Transformers	1	693	August 1, 2023
Unable to use model.generate for Vision encoder decoder model Beginners	3	1124	March 6, 2024
VisionEncoderDecoder/TrOCR Models	0	702	October 21, 2021
Error Training Vision Encoder Decoder for Image Captioning Intermediate	8	2927	June 8, 2024

EncoderDecoderModel generate text for a ViT as encoder

Update 1:

Related topics