Multimodal datasets and corresponding models

Is there a model that can process multi-modal data like “CMU-MOSI” on huggingface? I’ve just learned, please advise.

1 Like

If you search for models that can handle images and text using keywords such as VQA or VL, you should be able to find many. There are still very few models that can handle audio, but the following are some well-known, recent examples.

appreciate

1 Like