Support for different models in text-to-image pipeline

anakin87 · January 13, 2023, 2:39pm

Hello!

I’ve seen this great space by @nielsr: Comparing Captioning Models - a Hugging Face Space by nielsr. It includes GIT and BLIP models.

Currently, text-to-image pipeline only supports these models, which are based on a different architecture.

Are there any plans to support GIT/CLIP models in the text-to-image pipeline?

I’m trying to build a feature in open-source library based on this pipeline and it would be great to switch to these more modern and performant models in the future…

nielsr · January 13, 2023, 3:08pm

Yes that’s definitely on the roadmap, just opened an issue for it! Add support for BLIP and GIT in image-to-text and VQA pipelines · Issue #21110 · huggingface/transformers · GitHub

Topic		Replies	Views
Image to text model that can take an additional text input 🤗Transformers	1	281	October 2, 2023
Blip-2 for extraction of image and text embeddings 🤗Transformers	0	656	September 20, 2024
Using Huggingface for computer vision (Tensorflow)? 🤗Transformers	3	417	June 2, 2025
Unsupported pipeline type: image-text-to-text Spaces	3	184	December 4, 2024
Inference provider for captioning (image2text model) Beginners	3	33	June 16, 2025

Support for different models in text-to-image pipeline

Related topics