Inference providers: Access to processor data?

I love the HF inference providers, but now ran into a question:

Is it possible to get access to the model’s processor output as well via the API?

My specific use-case is with Qwen2.5-VL. I ask the model to perform localization tasks on document images. I ask the model to find bounding box coordinates for page elements. The model generally does very well in this task.

In order to correctly map the localization data returned from the model to my original image sizes, I found that I needed to access the processor’s inputs. That’s because the Qwen processor adjusts image sizes, something that I think is pretty common for many models working with vision encoders. In my case, using the transformers library:

inputs = processor(text=[text], images=images, padding=True, return_tensors="pt")
...
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
 
# Now I can obtain the input image size:
input_height = inputs['image_grid_thw'][0][1]*14
input_width = inputs['image_grid_thw'][0][2]*14

The model’s localization coordinates will be based on that image size, and this is important to scale those coordinates to some other image dimensions the user actually sees.

How could I solve this using the Inference API?

1 Like

If it were a Dedicated Endpoint that you could maintain yourself, you could change the return value by just rewriting handler.py, but since you are using the Inference Provider, that part is a black box.

Therefore, as you suggested, mimicking the processing that is likely being done internally is a relatively lightweight and better approach…
With the following code, the entire model will not be downloaded. It should be possible to use JSON alone.

from PIL import Image
import requests
from transformers import AutoProcessor

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/diffusion-quicktour.png"
orig = Image.open(requests.get(url, stream=True).raw)
prompt = "describe this image"
processor  = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

inputs = processor(images=[orig], text=[prompt], padding=True, return_tensors="pt")

grid_h, grid_w = inputs["image_grid_thw"][0][1:].tolist()
proc_h, proc_w = grid_h * 14, grid_w * 14
sx, sy = orig.width / proc_w, orig.height / proc_h
print(inputs["image_grid_thw"], sx, sy) # tensor([[ 1, 18, 18]]) 1.0158730158730158 1.0158730158730158

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.