Hey @dmatos2012 , don’t worry about experience. We always try to make things easier for everyone and we have a super cool speaker lineup for getting familiar with JAX/Flax/Transformers. And we will try to answer all questions:)
@mrm8488 For image captioning it’ll be more like an encoder-decoder model. The encoder will be an image model and the decoder can be any transformer model with cross-attention which will take hidden_states
from image model and will generate text auto-regressively