Dimensionality matching in multimodal transformer model

Hello! I am a student currently working on developing a multimodal transformer model with unimodal encoders feeding into a bottleneck fusion layer. This is my first time working with multimodality and I was confused while learning some background information:

  1. How is the full model trained? Since each individual modality has its own token encoder, how can I ensure that each of these unimodal models is learning the correct representation?
  2. Is there a more concrete way to determine what the dimensionality of my tokens to be fed into the cross-modal fusion layer should be rather than just trial and error? I am confused since my modalities are different in complexity and I would like to find a sweet spot between avoiding overfitting and underfitting due to high/low dimensionality.

Here is an extremely broad sketch of my generic pipeline

Thanks for the help!!

2 Likes

First and foremost: Welcome @ch106 Thank you for sharing.
I am an older person so every post seems like Christmas to me.

If you like I would read about what a " multimodal transformer " is. My background is in binary languages. I am learning.

1 Like

Use aux loss on each encoder before fusion?