I am fairly new to transformers and deep learning in general so please be kind,
I am currently working on a project that will caption images using either Transformer-XL or BERT, however, I am not sure how to pass the image tensor that is [608, 608, 3] from my CNN to the transformer model for text generation, can anyone help?
Please feel free to ask questions, I would be glad to assist in any way I can.