Issue with Salesforce/blip-image-captioning-large Endpoint: "input_ids or inputs_embeds" Error

Hello Hugging Face Community,

I am reaching out to seek your expertise regarding an issue I’m facing with the Salesforce/blip-image-captioning-large model via the Inference Endpoints.

Here’s a detailed outline of the problem:

  • Interface API Functionality: When using the Interface API, the process is smooth. I can send an image URL using json={"inputs": image_url}, and it returns the expected caption without the need to download the image.
  • Inference Endpoint Issue: However, the same success is not replicated when I switch to using the Inference Endpoints. Regardless of whether I send the image URL directly or download the image and send it, I encounter the following error: {"error": "You have to specify either input_ids or inputs_embeds or encoder_embeds"}. This error persists and prevents the endpoint from generating the expected caption.
  • Endpoint Testing Failure: To further investigate, I utilized the ‘Test your endpoint!’ feature within the Hugging Face platform by dragging and dropping an image directly. Unfortunately, this also resulted in the same error message.

The crux of the problem seems to be related to the expected request structure for the Inference Endpoint, which differs from the Interface API. The error suggests that the endpoint is expecting specific parameters (input_ids, inputs_embeds, or encoder_embeds) that are not clearly documented or are different from the Interface API’s requirements.

I am looking for guidance on how to resolve this inconsistency:

  1. What is the correct way to structure the request for the Inference Endpoint when using image URLs?
  2. Is there a step I might be overlooking that would account for the difference in behavior between the Interface API and the Inference Endpoint?
  3. Has anyone successfully used the Inference Endpoint with image URLs, and if so, could you share an example request?

Any insights, code snippets, or documentation references that could shed light on this issue would be incredibly valuable.

Thank you in advance for your assistance and support.

Warm regards,

i wish HF documented the expected inputs for the API and inference endpoints. i assumed they were the same schema.

did you resolve this for the Salesforce/blip model? because i’m facing the same issue…