How to use an image tensor for caption generation with Transformer-XL or BERT?

I am fairly new to transformers and deep learning in general so please be kind,

I am currently working on a project that will caption images using either Transformer-XL or BERT, however, I am not sure how to pass the image tensor that is [608, 608, 3] from my CNN to the transformer model for text generation, can anyone help?

Please feel free to ask questions, I would be glad to assist in any way I can.

1 Like

Guess I’m late. Although I’m not an expert, I can give you some idea. You can use some network like ResNet, DenseNet to ‘encode’ the image into a 1-D tensor, and then use this tensor to generate captions using a transformer.