Inference provider for captioning (image2text model)

Hi everyone,

I’m new to hugging face and trying to do a project with typescript and the hugging inference api. I already coded some methods to generate text with the inference api.

What I try to do now is sending an image to an inference provider (like replicate) and get a description (caption) of that image (image2text). But I don’t know which provider and model to use for that purpose.

Could you please tell me which provider and model to use (I could not find one)?

Thank you so much :slight_smile:

1 Like

I don’t think any have actually been deployed…

It seems that several Image-text-to-text models, which are similar tasks, have been deployed. Another method is to use Gradio Space remotely via the Gradio API. Like this:

Thank you @John6666 for your quick reply.
What is the difference between a image-to-text model and a image-text-to-text model? :sweat_smile:

I have a problem using one of the models and providers, that you listed under the image-text-to-text models.
When I call the method
client.imageToText({...})
with the provider “novita” (a provider that is listed) I get the error:
Something went wrong: InputError: Task 'image-to-text' not supported for provider 'novita'. Available tasks: conversational,text-generation,text-to-video

Do I have to call another method (and if so: which one) or is the provider not suitable? :face_with_monocle:

I don’t know what Gradio Space is, but I will try to dive into that topic as well. :slight_smile:

1 Like

What is the difference between a image-to-text model and a image-text-to-text model? :sweat_smile:

Image-to-text typically uses traditional vision models (such as ViT or CLIP) and is considered standard captioning.
Image-text-to-text, on the other hand, involves inputting both images and text, enabling the extraction of more contextually relevant content. This approach is commonly implemented using VLM (Visual Language Modeling) or multimodal models.

In your case, Image-to-text is likely sufficient for your needs.:sweat_smile:
However, it’s worth noting that this functionality may not be available for deployment across all providers at this time…

While this isn’t strictly a bug or issue, it could be considered an inconvenience in terms of site usability. @meganariley @michellehbn