Dimensionality matching in multimodal transformer model

ch106 · July 11, 2025, 10:03pm

Hello! I am a student currently working on developing a multimodal transformer model with unimodal encoders feeding into a bottleneck fusion layer. This is my first time working with multimodality and I was confused while learning some background information:

How is the full model trained? Since each individual modality has its own token encoder, how can I ensure that each of these unimodal models is learning the correct representation?
Is there a more concrete way to determine what the dimensionality of my tokens to be fed into the cross-modal fusion layer should be rather than just trial and error? I am confused since my modalities are different in complexity and I would like to find a sweet spot between avoiding overfitting and underfitting due to high/low dimensionality.

Here is an extremely broad sketch of my generic pipeline

Thanks for the help!!

Ernst03 · July 11, 2025, 11:11pm

First and foremost: Welcome @ch106 Thank you for sharing.
I am an older person so every post seems like Christmas to me.

If you like I would read about what a " multimodal transformer " is. My background is in binary languages. I am learning.

Felicitywood · July 12, 2025, 7:37am

Use aux loss on each encoder before fusion?

Topic		Replies	Views
Multimodal architectures with HuggingFace transformers for speech and text 🤗Transformers	3	1132	November 14, 2022
Multimodal transformer Models	0	1071	April 23, 2023
Multimodal fusion options - thoughts? Research	0	22	May 6, 2025
MMBT Model (Resnet and BERT) for multimodal embeddings 🤗Transformers	3	3982	November 10, 2021
Multimodal Transformers with signal inputs Beginners	0	90	May 9, 2024

Dimensionality matching in multimodal transformer model

Related topics