Sagemaker VQA Models (Donut)

I have followed this wonderful tutorial by @philschmid for using and finetuning the donut model for document understanding. (Thank you so much for the tutorial!)

I am trying to reproduce this for the VQA version of donut:

My first step was to deploy this base model on sagemaker to see if it works, but I am having some trouble. I am using the default “deploy to Amazon Sagemaker” code, provided in the link, but i changed the 'HF_TASK':'document-question-answering' to 'HF_TASK':'visual-question-answering'

The endpoint did end up successfully spinning up, but I seem to be having some trouble, especially with feeding the model both image data and the question data. In the example from Philipp, an image seralizer was used so that he can directly feed the raw bytes of the image data into the endpoint to get a result. However, my issue is that my input requires both an image and some text (the question). Some things I tried:

  1. I tried just using json, however the PIL image is not json serializable.
  2. I tried converting the PIL image to a numpy array then a list, but it seems to run into a size issue. (this is for a single image, not batch)

On a side note, I noticed it also does take image URLs. I tried to use a URL, but also got the following error: "\u0027str\u0027 object is not callable".

My first goal would be to get this base model running and to be able to send both local image data and the question to it for it to answer questions.

Any help would be greatly appreciated! Thanks