I love the HF inference providers, but now ran into a question:
Is it possible to get access to the model’s processor output as well via the API?
My specific use-case is with Qwen2.5-VL. I ask the model to perform localization tasks on document images. I ask the model to find bounding box coordinates for page elements. The model generally does very well in this task.
In order to correctly map the localization data returned from the model to my original image sizes, I found that I needed to access the processor’s inputs. That’s because the Qwen processor adjusts image sizes, something that I think is pretty common for many models working with vision encoders. In my case, using the transformers library:
inputs = processor(text=[text], images=images, padding=True, return_tensors="pt")
...
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
# Now I can obtain the input image size:
input_height = inputs['image_grid_thw'][0][1]*14
input_width = inputs['image_grid_thw'][0][2]*14
The model’s localization coordinates will be based on that image size, and this is important to scale those coordinates to some other image dimensions the user actually sees.
How could I solve this using the Inference API?