What if sequence of outputs of ViT is fed into GPT

anon3699016 · August 4, 2022, 2:01pm

Hi!
I’m going to build a model like the picture below:

where a vision transformer(ViT) model’s outputs for different pictures are fed into GPT model.

I’m worried about the size of gradient update, and also the training time for this model.

Let’s say that 10 outputs of the ViT are fed into the GPT,
then does this mean that it requires 10 x (gradient update size of ViT parameters) GPU memory for training? Or gradient updates are sequentially aggregated on-the-fly, so maybe it requires only 2 x (gradient update size of ViT parameters)?

Also I guess training this model would require significant amount of time, being very slow…
but I’m not sure whether I’m right or not.

Any guess would be appreciated.
Thanks!

Topic		Replies	Views
GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation Beginners	1	11	June 28, 2025
Finetuning and single-GPU utilization 🤗Transformers	0	489	August 19, 2021
Img2seq model with pretrained weights Beginners	7	1215	November 18, 2021
Parameters that contribute to GPU Memory Models	0	245	November 23, 2023
Image Captioning with ViT and GPT 2 Base Models	2	61	May 10, 2025

What if sequence of outputs of ViT is fed into GPT

Related topics