How to combine images and text in SageMaker

Hi there AWS + HuggingFace heros,

I am working on an exciting model inside SageMaker that should fine-tune on a multi-class classification task with an input of both images and text. Hence, multimodal.

I cannot find an example that showcases how to deal with multimodal models in SageMaker (if there is one - please enlighten me).

How I plan to come around this is by following one vision text multi-class example - see the below:

And then follow a vision classification example - see the below:

And finally, try to figure out how I can combine the tokens and visual embedding to a multimodal model, e.g. VisualBERT.

Would you approach this differently? Would anyone have any experience in how to combine text and images as a multimodal input in SageMaker? Please share if you have any tips, thanks.

Hey @Petrus,

sounds like a cool project! Fine-tuning a model on multi-modality data (vision and text) should make a huge difference to just text. You can pass your data from Amazon S3 to the training job through the .fit() method when starting your training. SageMaker will load the data into /opt/ml/input/{train,test}/ from where you can access it during the training, this folder can contain images or text.

If you need additional dependencies you can provide a requirements.txt in your source_dir SageMaker will then install those dependencies for running before running your train.py