Support for different models in text-to-image pipeline


I’ve seen this great space by @nielsr: Comparing Captioning Models - a Hugging Face Space by nielsr. It includes GIT and BLIP models.

Currently, text-to-image pipeline only supports these models, which are based on a different architecture.

Are there any plans to support GIT/CLIP models in the text-to-image pipeline?

I’m trying to build a feature in open-source library based on this pipeline and it would be great to switch to these more modern and performant models in the future…

Yes that’s definitely on the roadmap, just opened an issue for it! Add support for BLIP and GIT in image-to-text and VQA pipelines · Issue #21110 · huggingface/transformers · GitHub

1 Like