How to combine images and text in SageMaker

Hi there AWS + HuggingFace heros,

I am working on an exciting model inside SageMaker that should fine-tune on a multi-class classification task with an input of both images and text. Hence, multimodal.

I cannot find an example that showcases how to deal with multimodal models in SageMaker (if there is one - please enlighten me).

How I plan to come around this is by following one vision text multi-class example - see the below:

And then follow a vision classification example - see the below:

And finally, try to figure out how I can combine the tokens and visual embedding to a multimodal model, e.g. VisualBERT.

Would you approach this differently? Would anyone have any experience in how to combine text and images as a multimodal input in SageMaker? Please share if you have any tips, thanks.

Hey @Petrus,

sounds like a cool project! Fine-tuning a model on multi-modality data (vision and text) should make a huge difference to just text. You can pass your data from Amazon S3 to the training job through the .fit() method when starting your training. SageMaker will load the data into /opt/ml/input/{train,test}/ from where you can access it during the training, this folder can contain images or text.

If you need additional dependencies you can provide a requirements.txt in your source_dir SageMaker will then install those dependencies for running before running your train.py

Hi @Petrus, nice question! I hope you had luck with setting up your multimodal project. I have just been playing with the same problem and had a similar question to yours

… any experience in how to combine text and images as a multimodal input in SageMaker?

I’ll summarize what I did - might help others in doing the same:

Training multimodal models in SageMaker

This is the easier part since training data is most probably in your controlled environment somewhere in S3. What you could do is prepare train/test data as CSV or JSON Lines containing textual or other features as well as paths to images somewhere on S3. To train on this data you’d need to download each image on runtime, so your data loaders and dataset class should have methods to do this. I’m using pytorch, and it was just a matter of having a custom Dataset download each image on __getitem__(self, idx). How you combine features for training depends on your model, but one idea is to concat embeddings.

Serving multimodal models in SageMaker

Serving is a bit trickier since you’d probably want a user-friendly interface to send images+text and you’d need to set proper content_type. I managed to do that by preparing multipart/form-data content and scoring against the model endpoint using InvokeEndpoint API and alternatively you could use a snippet below to write your own Serializer to use in predictor.predict() interface. Preference is to have Serializer implemented, since then user can send just json with image path and textual features, similar to train data rows.

payload, content_type = urllib3.encode_multipart_formdata({
    "text": "some textual feature",
    "photo": ("image_name", open("image_path", "rb").read(), "image_mime_type")
}, boundary="random_string_for_multipart_content_boundary")

sm_runtime = boto3.client("sagemaker-runtime")

response = sm_runtime.invoke_endpoint(
    EndpointName="your_deployed_model_endpoint_name",
    ContentType=content_type,
    Accept='application/json',
    Body=payload
)

In your serving script you could expect content type to be multipart/form-data; boundary="random_str_you_set", textual features could be parsed from request.data and image file content from request.files. Hope this helps!