Quite a few vision+language papers pretrain BERT-based model with image-text data and finetune for image captioning task. But there is no decoder involved to generate sentences. Does that make sense? And what’s the main difference between using T’s encoder to do the sentence generation and do it with a T’ decoder?