Image Captioning - ViT + BERT with WIT

dmatos2012 · July 7, 2021, 9:09am

Hi,

So am fairly new at NLP, so im just getting used to HF and NLP in general. My project is at img-caption in spanish. Thus following questions,

IT looks like wit dataset isnt on HF library, so downloaded it on the TPU. But files are on tsv and the HF parsers I found support csv, py dicts, json etcs. What should be the fastest way here to parse correctly? any HF library for TSVs?
Valhalla my proj lead suggested using run_summarization_flax.py script, but I am a bit confused. How would I go about specifying ViT as encoder; and how would I go about adding my cross-attn layer to BERT and specifying it to that script ? I see only one model_dir and tokenizer_name which I am not sure for my use case.

I just want to start with pre-trained networsk and once it works, go with fine tuning etc. Thanks

valhalla · July 7, 2021, 9:22am

Hi there,

Regrading the dataset:
Yes, the dataset is not available in datasets lib. You will need to download the tsv and the prepare the dataset by downloading the image. The tsv file for wit contains the image URLs and other metadata.
This script might help. It’s for downloading conceptual captions data, but you could re-purpose it to download WIT.

Regarding model:
There is no off-the-shelf model for this in transformers (yet!). What we need here is a Seq2Seq model. Where that image encoder is a pre-trained vision model like ViT or CLIP's vision model and the decoder is any pre-trained text model (BERT/ROBERTa).

To do this we will need to modify the BERT/ROBERTa model and add a cross-attention layer in it.
The encoder will the encode images, the decoder will take the decoder_input_ids and the encoder_hidden_states.

This should be similar to encoder-decoder model in pytorch transformers/modeling_encoder_decoder.py at master · huggingface/transformers · GitHub.

@gchhablani is also working on a similar project and might have the modeling code ready

Let me know if you have any more question.

johnrodriguez190380 · October 21, 2021, 7:19am

Any progress on this topic?

Topic		Replies	Views
Image captioning for French with pre-trained vision and text model Flax/JAX Projects	6	2162	January 4, 2022
Image captioning for Spanish with pre-trained vision and text model Flax/JAX Projects	13	2495	July 19, 2021
Image captioning for Japanese with pre-trained vision and text model Flax/JAX Projects	0	1173	June 23, 2021
Multilingual Image Captioning Flax/JAX Projects	10	1286	July 6, 2021
CLIP like contrastive vision-language models for German with pre-traind text and vision models Flax/JAX Projects	5	1829	July 4, 2021

Image Captioning - ViT + BERT with WIT

Related topics