How do I use Text-Image to Text models with Huggingface Inference?

It’s in the manual, but it’s a newly implemented pipeline, so I don’t know if it really works.