Multimodal transformer

Hi! I’m currently having social media data with 4 modalities - image, text(sentences and context-free text like hashtags), categories, and time-series based(posting data per post and also username who posted it). I explored the Huggingface models for multimodal transformers and found that all the models used only 2 modalities (text-tex or image-text or speech-text or graph transformers).

Can I use an image-text multimodal transformer and fine-tune it for my dataset with 4 modalities where i/p=post’s information grouped by the user? Any tips on whether it’ll be good enough and how to do it?