Image Captioning - ViT + BERT with WIT


So am fairly new at NLP, so im just getting used to HF and NLP in general. My project is at img-caption in spanish. Thus following questions,

  1. IT looks like wit dataset isnt on HF library, so downloaded it on the TPU. But files are on tsv and the HF parsers I found support csv, py dicts, json etcs. What should be the fastest way here to parse correctly? any HF library for TSVs?

  2. Valhalla my proj lead suggested using script, but I am a bit confused. How would I go about specifying ViT as encoder; and how would I go about adding my cross-attn layer to BERT and specifying it to that script ? I see only one model_dir and tokenizer_name which I am not sure for my use case.

I just want to start with pre-trained networsk and once it works, go with fine tuning etc. Thanks

1 Like

Hi there,

Regrading the dataset:
Yes, the dataset is not available in datasets lib. You will need to download the tsv and the prepare the dataset by downloading the image. The tsv file for wit contains the image URLs and other metadata.
This script might help. It’s for downloading conceptual captions data, but you could re-purpose it to download WIT.

Regarding model:
There is no off-the-shelf model for this in transformers (yet!). What we need here is a Seq2Seq model. Where that image encoder is a pre-trained vision model like ViT or CLIP's vision model and the decoder is any pre-trained text model (BERT/ROBERTa).

To do this we will need to modify the BERT/ROBERTa model and add a cross-attention layer in it.
The encoder will the encode images, the decoder will take the decoder_input_ids and the encoder_hidden_states.

This should be similar to encoder-decoder model in pytorch transformers/ at master · huggingface/transformers · GitHub.

@gchhablani is also working on a similar project and might have the modeling code ready :slight_smile:

Let me know if you have any more question.


Any progress on this topic?