Multimodal datasets and corresponding models

wRise · March 12, 2025, 10:06am

Is there a model that can process multi-modal data like “CMU-MOSI” on huggingface? I’ve just learned, please advise.

John6666 · March 12, 2025, 10:49am

If you search for models that can handle images and text using keywords such as VQA or VL, you should be able to find many. There are still very few models that can handle audio, but the following are some well-known, recent examples.

wRise · March 12, 2025, 11:41am

appreciate

Topic		Replies	Views
Multimodal transformer Models	0	1078	April 23, 2023
Fine-tunening a multimodal model Beginners	4	5361	December 25, 2024
How to combine images and text in SageMaker Amazon SageMaker	2	2284	October 13, 2022
Are there any multi modal LLMs which are open sourced? 🤗Transformers	2	2800	July 11, 2023
Multimodal architectures with HuggingFace transformers for speech and text 🤗Transformers	3	1139	November 14, 2022

Multimodal datasets and corresponding models

Related topics