How to combine images and text in SageMaker

Petrus · March 8, 2022, 12:52pm

Hi there AWS + HuggingFace heros,

I am working on an exciting model inside SageMaker that should fine-tune on a multi-class classification task with an input of both images and text. Hence, multimodal.

I cannot find an example that showcases how to deal with multimodal models in SageMaker (if there is one - please enlighten me).

How I plan to come around this is by following one vision text multi-class example - see the below:

github.com

philschmid/huggingface-sagemaker-workshop-series/blob/main/workshop_1_getting_started_with_amazon_sagemaker/lab_1_default_training.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab1: Finetuning HuggingFace models with Amazon SageMaker\n",
    "### Multi-Class Classification with `Trainer` and `emotion` dataset"
   ]
  },
  {
   "attachments": {
    "emotion-widget.png": {
     "image/png": "iVBORw0KGgoAAAANSUhEUgAABRYAAAM4CAYAAAC5rb6iAAAMamlDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnltSSWiBCEgJvQnSq5QQWgABqYKNkAQSSowJQcVeFhVcKyKKFV0VUXR1BWRREXtZFHtfLKgo66IuiqLyJiSg677yvfN9c+fPmTP/KZm5dwYArV6eVJqHagOQLymQJUSEsMampbNIHYAMDAAGtIA7jy+XsuPjYwCUwf7v8u4GQJT9VScl1z/H/6voCoRyPgDIeIgzBXJ+PsTNAOAb+FJZAQBEpd5yaoFUiedCrCeDAUJcpsTZKrxLiTNVuGnAJimBA/FlAMg0Hk+WDYDmPahnFfKzIY/mJ4hdJAKxBACtERAH8kU8AcTK2Efk509W4gqI7aC9FGIYD/DJ/IYz+2/8mUP8PF72EFblNSDkULFcmseb/n+W5n9Lfp5i0IcNbDSRLDJBmT+s4a3cydFKTIO4S5IZG6esNcS9YoGq7gCgVJEiMllljxrz5RxYP8CE2EXAC42G2BjicElebIxan5klDudCDFcLOk1cwE2C2ADixUJ5WKLaZotscoLaF1qXJeOw1fqzPNmAX6WvB4rcZLaa/41IyFXzY5pFoqRUiKkQWxWKU2Ih1oTYWZ6bGK22GVUk4sQO2sgUCcr4rSBOEEoiQlT8WGGWLDxBbV+SLx/MF9siEnNj1fhAgSgpUlUf7CSfNxA/zAW7LJSwkwd5hPKxMYO5CIShYarcsedCSXKimqdXWhCSoJqLU6V58Wp73EKYF6HUW0DsIS9MVM/FUwrg4lTx41nSgvgkVZx4UQ4vKl4VD74CxAAOCAUsoIAtE0wGOUDc2lXfBX+pRsIBD8hANhACJ7VmcEbqwIgEPhNBEfgDIiGQD80LGRgVgkKo/zykVT2dQNbAaOHAjFzwFOJ8EA3y4G/FwCzJkLcU8ARqxP/wzoOND+PNg005/u/1g9qvGjbUxKg1ikGPLK1BS2IYMZQYSQwn2uNGeCDuj8fAZzBsbrgP7juYx1d7wlNCG+ER4TqhnXB7kni+7LsoR4N2yB+urkXmt7XAbSCnJx6CB0B2yIwzcSPghHtAP2w8CHr2hFqOOm5lVVjfcf8tg2/+DbUdxYWCUoZRgil238/UdND0HGJR1vrb+qhizRyqN2do5Hv/nG+qL4B99PeW2GLsIHYGO46dw5qwesDCjmEN2EXsiBIPra4nA6tr0FvCQDy5kEf8D388tU9lJeUuNS6dLp9UYwXCaQXKjceZLJ0uE2eLClhs+HUQsrgSvvMIlpuLmysAym+N6vX1ljnwDUGY57/qFpgDEDC9v7+/6asuGr5zDx6B2//OV51tB3xNnAfg7Fq+Qlao0uHKBwG+JbTgTjMEpsAS2MF83IAX8AfBIAxEgTiQBNLARFhlEVznMjAVzATzQDEoBSvAGrAebAbbwC6wFxwA9aAJHAenwQVwGVwHd+Hq6QAvQTd4B/oQBCEhdISBGCJmiDXiiLghPkggEobEIAlIGpKBZCMSRIHMRBYgpcgqZD2yFalGfkYOI8eRc0gbcht5iHQib5CPKIbSUD3UBLVBR6I+KBuNRpPQCWg2OgUtQheiy9AKtArdg9ahx9EL6HW0HX2J9mAA08CYmDnmhPlgHCwOS8eyMBk2GyvByrEqrBZrhP/zVawd68I+4EScgbNwJ7iCI/FknI9PwWfjS/H1+C68Dj+JX8Uf4t34FwKdYExwJPgRuISxhGzCVEIxoZywg3CIcArupQ7COyKRyCTaEr3hXkwj5hBnEJcSNxL3EZuJbcTHxB4SiWRIciQFkOJIPFIBqZi0jrSHdIx0hdRB6iVrkM3IbuRwcjpZQp5PLifvJh8lXyE/I/dRtCnWFD9KHEVAmU5ZTtlOaaRconRQ+qg6VFtqADWJmkOdR62g1lJPUe9R32poaFho+GqM0RBrzNWo0NivcVbjocYHmi7NgcahjacpaMtoO2nNtNu0t3Q63YYeTE+nF9CX0avpJ+gP6L2aDE1nTa6mQHOOZqVmneYVzVdaFC1rLbbWRK0irXKtg1qXtLq0Kdo22hxtnvZs7Urtw9o3tXt0GDquOnE6+TpLdXbrnNN5rkvStdEN0xXoLtTdpntC9zEDY1gyOAw+YwFjO+MUo0OPqGerx9XL0SvV26vXqtetr6vvoZ+iP02/Uv+IfjsTY9owucw85nLmAeYN5sdhJsPYw4TDlgyrHXZl2HuD4QbBBkKDEoN9BtcNPhqyDMMMcw1XGtYb3jfCjRyMxhhNNdpkdMqoa7jecP/h/OElww8Mv2OMGjsYJxjPMN5mfNG4x8TUJMJEarLO5IRJlynTNNg0x7TM9KhppxnDLNBMbFZmdszsBUufxWblsSpYJ1nd5sbmkeYK863mreZ9FrYWyRbzLfZZ3LekWvpYZlmWWbZYdluZWY22mmlVY3XHmmLtYy2yXmt9xvq9ja1Nqs0im3qb57YGtlzbItsa23t2dLsguyl2VXbX7In2Pva59hvtLzugDp4OIodKh0uOqKOXo9hxo2PbCMII3xGSEVUjbjrRnNhOhU41Tg+dmc4xzvOd651fjbQamT5y5cgzI7+4eLrkuWx3ueuq6xrlOt+10fWNm4Mb363S7Zo73T3cfY57g/trD0cPoccmj1ueDM/Rnos8Wzw/e3l7ybxqvTq9rbwzvDd43/TR84n3Wepz1pfgG+I7x7fJ94Ofl1+B3wG/P/2d/HP9d/s/H2U7Sjhq+6jHARYBvICtAe2BrMCMwC2B7UHmQbygqqBHwZbBguAdwc/Y9uwc9h72qxCXEFnIoZD3HD/OLE5zKBYaEVoS2hqmG5Yctj7sQbhFeHZ4TXh3hGfEjIjmSEJkdOTKyJtcEy6fW83tjvKOmhV1MpoWnRi9PvpRjEOMLKZxNDo6avTq0fdirWMlsfVxII4btzrufrxt/JT4X8cQx8SPqRzzNME1YWbCmURG4qTE3YnvkkKSlifdTbZLViS3pGiljE+pTnmfGpq6KrV97Mixs8ZeSDNKE6c1pJPSU9J3pPeMCxu3ZlzHeM/xxeNvTLCdMG3CuYlGE/MmHpmkNYk36WAGISM1Y3fGJ14cr4rXk8nN3JDZzefw1/JfCoIFZYJOYYBwlfBZVkDWqqzn2QHZq7M7RUGiclGXmCNeL36dE5mzOed9blzuztz+vNS8ffnk/Iz8wxJdSa7k5GTTydMmt0kdpcXS9il+U9ZM6ZZFy3bIEfkEeUOBHjzUX1TYKX5QPCwMLKws7J2aMvXgNJ1pkmkXpztMXzL9WVF40U8z8Bn8GS0zzWfOm/lwFnvW1tnI7MzZLXMs5yyc0zE3Yu6uedR5ufN+m+8yf9X8vxakLmhcaLJw7sLHP0T8UFOsWSwrvrnIf9Hmxfhi8eLWJe5L1i35UiIoOV/qUlpe+mkpf+n5H11/rPixf1nWstblXss3rSCukKy4sTJo5a5VOquKVj1ePXp1XRmrrKTsrzWT1pwr9yjfvJa6VrG2vSKmomGd1boV6z6tF62/XhlSuW+D8YYlG95vFGy8sil4U+1mk82lmz9uEW+5tTVia12VTVX5NuK2wm1Pt6dsP/OTz0/VO4x2lO74vFOys31Xwq6T1d7V1buNdy+vQWsUNZ17xu+5vDd0b0OtU+3Wfcx9pfvBfsX+Fz9n/HzjQPSBloM+B2t/sf5lwyHGoZI6pG56XXe9qL69Ia2h7XDU4ZZG/8ZDvzr/urPJvKnyiP6R5UepRxce7T9WdKynWdrcdTz7+OOWSS13T4w9ce3kmJOtp6JPnT0dfvrEGfaZY2cDzjad8zt3+LzP+foLXhfqLnpePPSb52+HWr1a6y55X2q47Hu5sW1U29ErQVeOXw29evoa99qF67HX224k37h1c/zN9luCW89v591+fafwTt/dufcI90rua98vf2D8oOp3+9/3tXu1H3kY+vDio8RHdx/zH798In/yqWPhU/rT8mdmz6qfuz1v6gzvvPxi3IuOl9KXfV3Ff+j8seGV3atf/gz+82L32O6O17LX/W+WvjV8u/Mvj79aeuJ7HrzLf9f3vqTXsHfXB58PZz6mfnzWN/UT6VPFZ/vPjV+iv9zrz+/vl/JkvIGjAAYbmpUFwJudANDTAGDAMwR1nOouOCCI6v46gMB/wqr74oB4AVALO+UxntMMwH7YbOZCbtgrj/BJwQB1dx9qapFnubupuGjwJkTo7e9/awIAqRGAz7L+/r6N/f2ft8NgbwPQPEV1B1UKEd4ZtoQq0e3VE+aC70R1P/0mx+97oIzAA3zf/wt4tJFlaPiOogAAAIplWElmTU0AKgAAAAgABAEaAAUAAAABAAAAPgEbAAUAAAABAAAARgEoAAMAAAABAAIAAIdpAAQAAAABAAAATgAAAAAAAACQAAAAAQAAAJAAAAABAAOShgAHAAAAEgAAAHigAgAEAAAAAQAABRagAwAEAAAAAQAAAzgAAAAAQVNDSUkAAABTY3JlZW5zaG90qFMKUgAAAAlwSFlzAAAWJQAAFiUBSVIk8AAAAddpVFh0WE1MOmNvbS5hZG9iZS54bXAAAAAAADx4OnhtcG1ldGEgeG1sbnM6eD0iYWRvYmU6bnM6bWV0YS8iIHg6eG1wdGs9IlhNUCBDb3JlIDYuMC4wIj4KICAgPHJkZjpSREYgeG1sbnM6cmRmPSJodHRwOi8vd3d3LnczL

This file has been truncated. show original

And then follow a vision classification example - see the below:

github.com

huggingface/notebooks/blob/master/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Huggingface Sagemaker-sdk - Getting Started Demo\n",
    "### Binary Classification with `Trainer` and `imdb` dataset"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "1. [Introduction](#Introduction)  \n",
    "2. [Development Environment and Permissions](#Development-Environment-and-Permissions)\n",
    "    1. [Installation](#Installation)  \n",
    "    2. [Development environment](#Development-environment)  \n",
    "    3. [Permissions](#Permissions)\n",
    "3. [Processing](#Preprocessing)   \n",
    "    1. [Tokenization](#Tokenization)  \n",

This file has been truncated. show original

And finally, try to figure out how I can combine the tokens and visual embedding to a multimodal model, e.g. VisualBERT.

Would you approach this differently? Would anyone have any experience in how to combine text and images as a multimodal input in SageMaker? Please share if you have any tips, thanks.

philschmid · March 8, 2022, 4:26pm

Hey @Petrus,

sounds like a cool project! Fine-tuning a model on multi-modality data (vision and text) should make a huge difference to just text. You can pass your data from Amazon S3 to the training job through the .fit() method when starting your training. SageMaker will load the data into /opt/ml/input/{train,test}/ from where you can access it during the training, this folder can contain images or text.

If you need additional dependencies you can provide a requirements.txt in your source_dir SageMaker will then install those dependencies for running before running your train.py

mwise · October 13, 2022, 8:45am

Hi @Petrus, nice question! I hope you had luck with setting up your multimodal project. I have just been playing with the same problem and had a similar question to yours

… any experience in how to combine text and images as a multimodal input in SageMaker?

I’ll summarize what I did - might help others in doing the same:

Training multimodal models in SageMaker

This is the easier part since training data is most probably in your controlled environment somewhere in S3. What you could do is prepare train/test data as CSV or JSON Lines containing textual or other features as well as paths to images somewhere on S3. To train on this data you’d need to download each image on runtime, so your data loaders and dataset class should have methods to do this. I’m using pytorch, and it was just a matter of having a custom Dataset download each image on __getitem__(self, idx). How you combine features for training depends on your model, but one idea is to concat embeddings.

Serving multimodal models in SageMaker

Serving is a bit trickier since you’d probably want a user-friendly interface to send images+text and you’d need to set proper content_type. I managed to do that by preparing multipart/form-data content and scoring against the model endpoint using InvokeEndpoint API and alternatively you could use a snippet below to write your own Serializer to use in predictor.predict() interface. Preference is to have Serializer implemented, since then user can send just json with image path and textual features, similar to train data rows.

payload, content_type = urllib3.encode_multipart_formdata({
    "text": "some textual feature",
    "photo": ("image_name", open("image_path", "rb").read(), "image_mime_type")
}, boundary="random_string_for_multipart_content_boundary")

sm_runtime = boto3.client("sagemaker-runtime")

response = sm_runtime.invoke_endpoint(
    EndpointName="your_deployed_model_endpoint_name",
    ContentType=content_type,
    Accept='application/json',
    Body=payload
)

In your serving script you could expect content type to be multipart/form-data; boundary="random_str_you_set", textual features could be parsed from request.data and image file content from request.files. Hope this helps!

Topic		Replies	Views
Can text-to-image models be deployed to a SageMaker endpoint? Amazon SageMaker	1	2030	July 8, 2022
About the Amazon SageMaker category Amazon SageMaker	25	4120	August 5, 2021
Sagemaker multimodel endpoint Amazon SageMaker	1	480	February 2, 2023
Run detectron2 for feature extraction in SageMaker notebook Amazon SageMaker	8	2271	March 16, 2022
Batch_transform Pipeline? Amazon SageMaker	9	3467	September 28, 2021

How to combine images and text in SageMaker

Training multimodal models in SageMaker

Serving multimodal models in SageMaker

Related topics