Image captioning for French with pre-trained vision and text model

excuse me does the decoder of the language model deal with words or sentences to do the captioning . i mean did you use BERT for word embedding or sentence embedding … Thanks in advance