Question on text input in image captioning

I was going through this blog on image captioning.

According to the blog, the VisionEncoderDecoderModel uses this kind of architecture (shown below) where the image embeddings are fed to the decoder via the Encoder-Decoder Self-Attention.
Screenshot 2022-12-03 at 7.17.36 PM

What I wanted to ask is that if the input is only image then what are we feeding as input (denoted by Text Caption in the image) to the Transformer decoder?