We’re planning to use ViT encoder, mBART decoder and train them end-to-end for image captioning in different languages.
Since this is a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.
Our end use case would be to run this model on a video clip or movie and make it like an accessibility tool for visually challenged people. This can be showcased with a streamlit or gradio app.
We’re also planning to benchmark its zero-shot performance on VizWiz
-Training encoder-decoder model end-to-end in FLAX/JAX will require some effort
-Data Processing incase new datasets are available only in English