Hello! I am a student currently working on developing a multimodal transformer model with unimodal encoders feeding into a bottleneck fusion layer. This is my first time working with multimodality and I was confused while learning some background information:
- How is the full model trained? Since each individual modality has its own token encoder, how can I ensure that each of these unimodal models is learning the correct representation?
- Is there a more concrete way to determine what the dimensionality of my tokens to be fed into the cross-modal fusion layer should be rather than just trial and error? I am confused since my modalities are different in complexity and I would like to find a sweet spot between avoiding overfitting and underfitting due to high/low dimensionality.
Here is an extremely broad sketch of my generic pipeline
Thanks for the help!!