What if sequence of outputs of ViT is fed into GPT

I’m going to build a model like the picture below:

where a vision transformer(ViT) model’s outputs for different pictures are fed into GPT model.

I’m worried about the size of gradient update, and also the training time for this model.

Let’s say that 10 outputs of the ViT are fed into the GPT,
then does this mean that it requires 10 x (gradient update size of ViT parameters) GPU memory for training? Or gradient updates are sequentially aggregated on-the-fly, so maybe it requires only 2 x (gradient update size of ViT parameters)?

Also I guess training this model would require significant amount of time, being very slow…
but I’m not sure whether I’m right or not.

Any guess would be appreciated.