How do I use Text-Image to Text models with Huggingface Inference?

Hi all, I want to use phi-3.5-text-img-text for a multi modal task which takes the image of a page and converts the contents of the page into html tags and if there is an image found on the page, it will convert that image into text and add ‘illustration’ tag beside it.

How do I use the huggingface inference api for phi3.5instruct so that it generates the output.

1 Like

It’s in the manual, but it’s a newly implemented pipeline, so I don’t know if it really works.

Hi @perceptron-743
the model microsoft/Phi-3.5-vision-instruct contains custom-code.
By default, inference api is disabled for such models, this is done for security reasons since you can’t know what code the model is running in the back.
another way to see if the model has an inference api disabled is to go to the model page and check on the right


if the logo has a broken ligthening logo it means the inference api is disabled for that model.
To filter the hub by models that have inference api enabled you can checkout the warm and cold ones as shown in the picture below

1 Like

You’re right, but isn’t the server-side behavior a bit buggy?
When Inference is turned off, the input field itself on the GUI page doesn’t appear, which is normal behavior for other inferences.
Moreover, the author has not explicitly turned off Inference, and he appears to have done the user-enabled settings to turn it on.

---
license: mit
license_link: https://huggingface.co/microsoft/Phi-3.5-vision-instruct/resolve/main/LICENSE
language:
- multilingual
pipeline_tag: image-text-to-text
tags:
- nlp
- code
- vision
inference:
  parameters:
    temperature: 0.7
widget:
- messages:
  - role: user
    content: <|image_1|>Can you describe what you see in the image?
library_name: transformers
---