Multimodal fusion options - thoughts?

Hi All
I’m buidling a multimodal healthcare binary classification model and stuck. The modals are tabular, text and 2 imaging. I’ll use modality specific techniques for feature embedding eg. CNN for images, transformer for text and GBDT for tabular.
Now, heres the tricky part, for each subject there could be missing data i.e having text, tabular and 1 image but not the other.
Can anyone suggest the best way to fuse the embeddings taking into account the missing data. Thoughts are cross-attention, TFN, low rank multimodal fusion - but again the missing data issue?
Thanks

1 Like