Petrus
March 8, 2022, 12:52pm
#1
Hi there AWS + HuggingFace heros,
I am working on an exciting model inside SageMaker that should fine-tune on a multi-class classification task with an input of both images and text. Hence, multimodal.
I cannot find an example that showcases how to deal with multimodal models in SageMaker (if there is one - please enlighten me).
How I plan to come around this is by following one vision text multi-class example - see the below:
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab1: Finetuning HuggingFace models with Amazon SageMaker\n",
"### Multi-Class Classification with `Trainer` and `emotion` dataset"
]
},
{
"attachments": {
"emotion-widget.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABRYAAAM4CAYAAAC5rb6iAAAMamlDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnltSSWiBCEgJvQnSq5QQWgABqYKNkAQSSowJQcVeFhVcKyKKFV0VUXR1BWRREXtZFHtfLKgo66IuiqLyJiSg677yvfN9c+fPmTP/KZm5dwYArV6eVJqHagOQLymQJUSEsMampbNIHYAMDAAGtIA7jy+XsuPjYwCUwf7v8u4GQJT9VScl1z/H/6voCoRyPgDIeIgzBXJ+PsTNAOAb+FJZAQBEpd5yaoFUiedCrCeDAUJcpsTZKrxLiTNVuGnAJimBA/FlAMg0Hk+WDYDmPahnFfKzIY/mJ4hdJAKxBACtERAH8kU8AcTK2Efk509W4gqI7aC9FGIYD/DJ/IYz+2/8mUP8PF72EFblNSDkULFcmseb/n+W5n9Lfp5i0IcNbDSRLDJBmT+s4a3cydFKTIO4S5IZG6esNcS9YoGq7gCgVJEiMllljxrz5RxYP8CE2EXAC42G2BjicElebIxan5klDudCDFcLOk1cwE2C2ADixUJ5WKLaZotscoLaF1qXJeOw1fqzPNmAX6WvB4rcZLaa/41IyFXzY5pFoqRUiKkQWxWKU2Ih1oTYWZ6bGK22GVUk4sQO2sgUCcr4rSBOEEoiQlT8WGGWLDxBbV+SLx/MF9siEnNj1fhAgSgpUlUf7CSfNxA/zAW7LJSwkwd5hPKxMYO5CIShYarcsedCSXKimqdXWhCSoJqLU6V58Wp73EKYF6HUW0DsIS9MVM/FUwrg4lTx41nSgvgkVZx4UQ4vKl4VD74CxAAOCAUsoIAtE0wGOUDc2lXfBX+pRsIBD8hANhACJ7VmcEbqwIgEPhNBEfgDIiGQD80LGRgVgkKo/zykVT2dQNbAaOHAjFzwFOJ8EA3y4G/FwCzJkLcU8ARqxP/wzoOND+PNg005/u/1g9qvGjbUxKg1ikGPLK1BS2IYMZQYSQwn2uNGeCDuj8fAZzBsbrgP7juYx1d7wlNCG+ER4TqhnXB7kni+7LsoR4N2yB+urkXmt7XAbSCnJx6CB0B2yIwzcSPghHtAP2w8CHr2hFqOOm5lVVjfcf8tg2/+DbUdxYWCUoZRgil238/UdND0HGJR1vrb+qhizRyqN2do5Hv/nG+qL4B99PeW2GLsIHYGO46dw5qwesDCjmEN2EXsiBIPra4nA6tr0FvCQDy5kEf8D388tU9lJeUuNS6dLp9UYwXCaQXKjceZLJ0uE2eLClhs+HUQsrgSvvMIlpuLmysAym+N6vX1ljnwDUGY57/qFpgDEDC9v7+/6asuGr5zDx6B2//OV51tB3xNnAfg7Fq+Qlao0uHKBwG+JbTgTjMEpsAS2MF83IAX8AfBIAxEgTiQBNLARFhlEVznMjAVzATzQDEoBSvAGrAebAbbwC6wFxwA9aAJHAenwQVwGVwHd+Hq6QAvQTd4B/oQBCEhdISBGCJmiDXiiLghPkggEobEIAlIGpKBZCMSRIHMRBYgpcgqZD2yFalGfkYOI8eRc0gbcht5iHQib5CPKIbSUD3UBLVBR6I+KBuNRpPQCWg2OgUtQheiy9AKtArdg9ahx9EL6HW0HX2J9mAA08CYmDnmhPlgHCwOS8eyMBk2GyvByrEqrBZrhP/zVawd68I+4EScgbNwJ7iCI/FknI9PwWfjS/H1+C68Dj+JX8Uf4t34FwKdYExwJPgRuISxhGzCVEIxoZywg3CIcArupQ7COyKRyCTaEr3hXkwj5hBnEJcSNxL3EZuJbcTHxB4SiWRIciQFkOJIPFIBqZi0jrSHdIx0hdRB6iVrkM3IbuRwcjpZQp5PLifvJh8lXyE/I/dRtCnWFD9KHEVAmU5ZTtlOaaRconRQ+qg6VFtqADWJmkOdR62g1lJPUe9R32poaFho+GqM0RBrzNWo0NivcVbjocYHmi7NgcahjacpaMtoO2nNtNu0t3Q63YYeTE+nF9CX0avpJ+gP6L2aDE1nTa6mQHOOZqVmneYVzVdaFC1rLbbWRK0irXKtg1qXtLq0Kdo22hxtnvZs7Urtw9o3tXt0GDquOnE6+TpLdXbrnNN5rkvStdEN0xXoLtTdpntC9zEDY1gyOAw+YwFjO+MUo0OPqGerx9XL0SvV26vXqtetr6vvoZ+iP02/Uv+IfjsTY9owucw85nLmAeYN5sdhJsPYw4TDlgyrHXZl2HuD4QbBBkKDEoN9BtcNPhqyDMMMcw1XGtYb3jfCjRyMxhhNNdpkdMqoa7jecP/h/OElww8Mv2OMGjsYJxjPMN5mfNG4x8TUJMJEarLO5IRJlynTNNg0x7TM9KhppxnDLNBMbFZmdszsBUufxWblsSpYJ1nd5sbmkeYK863mreZ9FrYWyRbzLfZZ3LekWvpYZlmWWbZYdluZWY22mmlVY3XHmmLtYy2yXmt9xvq9ja1Nqs0im3qb57YGtlzbItsa23t2dLsguyl2VXbX7In2Pva59hvtLzugDp4OIodKh0uOqKOXo9hxo2PbCMII3xGSEVUjbjrRnNhOhU41Tg+dmc4xzvOd651fjbQamT5y5cgzI7+4eLrkuWx3ueuq6xrlOt+10fWNm4Mb363S7Zo73T3cfY57g/trD0cPoccmj1ueDM/Rnos8Wzw/e3l7ybxqvTq9rbwzvDd43/TR84n3Wepz1pfgG+I7x7fJ94Ofl1+B3wG/P/2d/HP9d/s/H2U7Sjhq+6jHARYBvICtAe2BrMCMwC2B7UHmQbygqqBHwZbBguAdwc/Y9uwc9h72qxCXEFnIoZD3HD/OLE5zKBYaEVoS2hqmG5Yctj7sQbhFeHZ4TXh3hGfEjIjmSEJkdOTKyJtcEy6fW83tjvKOmhV1MpoWnRi9PvpRjEOMLKZxNDo6avTq0fdirWMlsfVxII4btzrufrxt/JT4X8cQx8SPqRzzNME1YWbCmURG4qTE3YnvkkKSlifdTbZLViS3pGiljE+pTnmfGpq6KrV97Mixs8ZeSDNKE6c1pJPSU9J3pPeMCxu3ZlzHeM/xxeNvTLCdMG3CuYlGE/MmHpmkNYk36WAGISM1Y3fGJ14cr4rXk8nN3JDZzefw1/JfCoIFZYJOYYBwlfBZVkDWqqzn2QHZq7M7RUGiclGXmCNeL36dE5mzOed9blzuztz+vNS8ffnk/Iz8wxJdSa7k5GTTydMmt0kdpcXS9il+U9ZM6ZZFy3bIEfkEeUOBHjzUX1TYKX5QPCwMLKws7J2aMvXgNJ1pkmkXpztMXzL9WVF40U8z8Bn8GS0zzWfOm/lwFnvW1tnI7MzZLXMs5yyc0zE3Yu6uedR5ufN+m+8yf9X8vxakLmhcaLJw7sLHP0T8UFOsWSwrvrnIf9Hmxfhi8eLWJe5L1i35UiIoOV/qUlpe+mkpf+n5H11/rPixf1nWstblXss3rSCukKy4sTJo5a5VOquKVj1ePXp1XRmrrKTsrzWT1pwr9yjfvJa6VrG2vSKmomGd1boV6z6tF62/XhlSuW+D8YYlG95vFGy8sil4U+1mk82lmz9uEW+5tTVia12VTVX5NuK2wm1Pt6dsP/OTz0/VO4x2lO74vFOys31Xwq6T1d7V1buNdy+vQWsUNZ17xu+5vDd0b0OtU+3Wfcx9pfvBfsX+Fz9n/HzjQPSBloM+B2t/sf5lwyHGoZI6pG56XXe9qL69Ia2h7XDU4ZZG/8ZDvzr/urPJvKnyiP6R5UepRxce7T9WdKynWdrcdTz7+OOWSS13T4w9ce3kmJOtp6JPnT0dfvrEGfaZY2cDzjad8zt3+LzP+foLXhfqLnpePPSb52+HWr1a6y55X2q47Hu5sW1U29ErQVeOXw29evoa99qF67HX224k37h1c/zN9luCW89v591+fafwTt/dufcI90rua98vf2D8oOp3+9/3tXu1H3kY+vDio8RHdx/zH798In/yqWPhU/rT8mdmz6qfuz1v6gzvvPxi3IuOl9KXfV3Ff+j8seGV3atf/gz+82L32O6O17LX/W+WvjV8u/Mvj79aeuJ7HrzLf9f3vqTXsHfXB58PZz6mfnzWN/UT6VPFZ/vPjV+iv9zrz+/vl/JkvIGjAAYbmpUFwJudANDTAGDAMwR1nOouOCCI6v46gMB/wqr74oB4AVALO+UxntMMwH7YbOZCbtgrj/BJwQB1dx9qapFnubupuGjwJkTo7e9/awIAqRGAz7L+/r6N/f2ft8NgbwPQPEV1B1UKEd4ZtoQq0e3VE+aC70R1P/0mx+97oIzAA3zf/wt4tJFlaPiOogAAAIplWElmTU0AKgAAAAgABAEaAAUAAAABAAAAPgEbAAUAAAABAAAARgEoAAMAAAABAAIAAIdpAAQAAAABAAAATgAAAAAAAACQAAAAAQAAAJAAAAABAAOShgAHAAAAEgAAAHigAgAEAAAAAQAABRagAwAEAAAAAQAAAzgAAAAAQVNDSUkAAABTY3JlZW5zaG90qFMKUgAAAAlwSFlzAAAWJQAAFiUBSVIk8AAAAddpVFh0WE1MOmNvbS5hZG9iZS54bXAAAAAAADx4OnhtcG1ldGEgeG1sbnM6eD0iYWRvYmU6bnM6bWV0YS8iIHg6eG1wdGs9IlhNUCBDb3JlIDYuMC4wIj4KICAgPHJkZjpSREYgeG1sbnM6cmRmPSJodHRwOi8vd3d3LnczL
This file has been truncated. show original
And then follow a vision classification example - see the below:
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Huggingface Sagemaker-sdk - Getting Started Demo\n",
"### Binary Classification with `Trainer` and `imdb` dataset"
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"1. [Introduction](#Introduction) \n",
"2. [Development Environment and Permissions](#Development-Environment-and-Permissions)\n",
" 1. [Installation](#Installation) \n",
" 2. [Development environment](#Development-environment) \n",
" 3. [Permissions](#Permissions)\n",
"3. [Processing](#Preprocessing) \n",
" 1. [Tokenization](#Tokenization) \n",
This file has been truncated. show original
And finally, try to figure out how I can combine the tokens and visual embedding to a multimodal model, e.g. VisualBERT.
Would you approach this differently? Would anyone have any experience in how to combine text and images as a multimodal input in SageMaker? Please share if you have any tips, thanks.
Hey @Petrus ,
sounds like a cool project! Fine-tuning a model on multi-modality data (vision and text) should make a huge difference to just text. You can pass your data from Amazon S3 to the training job through the .fit()
method when starting your training. SageMaker will load the data into /opt/ml/input/{train,test}/
from where you can access it during the training, this folder can contain images or text.
If you need additional dependencies you can provide a requirements.txt
in your source_dir
SageMaker will then install those dependencies for running before running your train.py