Dimensionality matching in multimodal transformer model

Use aux loss on each encoder before fusion?

2 Likes