Inference provider for captioning (image2text model)

eitelk · June 16, 2025, 2:08pm

Hi everyone,

I’m new to hugging face and trying to do a project with typescript and the hugging inference api. I already coded some methods to generate text with the inference api.

What I try to do now is sending an image to an inference provider (like replicate) and get a description (caption) of that image (image2text). But I don’t know which provider and model to use for that purpose.

Could you please tell me which provider and model to use (I could not find one)?

Thank you so much

John6666 · June 16, 2025, 2:23pm

I don’t think any have actually been deployed…

It seems that several Image-text-to-text models, which are similar tasks, have been deployed. Another method is to use Gradio Space remotely via the Gradio API. Like this:

eitelk · June 16, 2025, 3:08pm

Thank you @John6666 for your quick reply.
What is the difference between a image-to-text model and a image-text-to-text model?

I have a problem using one of the models and providers, that you listed under the image-text-to-text models.
When I call the method
client.imageToText({...})
with the provider “novita” (a provider that is listed) I get the error:
Something went wrong: InputError: Task 'image-to-text' not supported for provider 'novita'. Available tasks: conversational,text-generation,text-to-video

Do I have to call another method (and if so: which one) or is the provider not suitable?

I don’t know what Gradio Space is, but I will try to dive into that topic as well.

John6666 · June 16, 2025, 3:25pm

What is the difference between a image-to-text model and a image-text-to-text model?

Image-to-text typically uses traditional vision models (such as ViT or CLIP) and is considered standard captioning.
Image-text-to-text, on the other hand, involves inputting both images and text, enabling the extraction of more contextually relevant content. This approach is commonly implemented using VLM (Visual Language Modeling) or multimodal models.

In your case, Image-to-text is likely sufficient for your needs.
However, it’s worth noting that this functionality may not be available for deployment across all providers at this time…

While this isn’t strictly a bug or issue, it could be considered an inconvenience in terms of site usability. @meganariley @michellehbn

Topic		Replies	Views
How do I use text-to-image huggingface models as an API for my website? Beginners	1	6053	April 20, 2023
How do I use Text-Image to Text models with Huggingface Inference? Beginners	3	251	October 12, 2024
Image to Text API Inference - Input Error Inference Endpoints on the Hub	0	450	October 30, 2023
How to make an inference for HuggingFaceModel of type 'image-to-text' Amazon SageMaker	0	501	January 27, 2024
Deploying CLIP-Vit as an inference endpoint Inference Endpoints on the Hub	1	452	December 20, 2023

Inference provider for captioning (image2text model)

Related topics