How can I make a Img2Text transformer using the existent modules?

coldfir3 · May 27, 2021, 9:42pm

I am trying to build a captioning system that inputs an image and outputs a caption.

This is an attempt to solve the Kaggle competition: Bristol-Myers Squibb – Molecular Translation | Kaggle.

For that I tried to build a Bert encoder-decoder module with ViT Embeddings using the following code:

class ViTBert(nn.Module):

  def __init__(self, vocab_size):
    super().__init__()
    config_encoder = BertGenerationConfig(
            vocab_size = vocab_size,
            hidden_size = 256,
            num_hidden_layers = 4,
            num_attention_heads = 4,
            intermediate_size = 1024,
            bos_toke_id = 0,
            eos_token_id = 2
    )
    config_decoder = BertGenerationConfig(
            vocab_size = vocab_size,
            hidden_size = 256,
            num_hidden_layers = 4,
            num_attention_heads = 4,
            intermediate_size = 1024,
            add_cross_attention=True, 
            is_decoder=True,
            bos_toke_id = 0,
            eos_token_id = 2
    )       config_encdec = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
    config_emb = ViTConfig(
        hidden_size = 256,
        num_hidden_layers = 4,
        num_attention_heads = 4,
        intermediate_size = 1024,
        image_size = 224,
        patch_size = 16,
        num_channels = 1
    )
    self.emb = ViTModel(config_emb).embeddings
    self.model = EncoderDecoderModel(config_encdec)

  def forward(self, x, y = None, l = None):

    return self.model(
        inputs_embeds = self.emb(x),
        decoder_input_ids = y,
        labels = l
        )     

  def beam_search(self, x, y, beam_scorer, criteria):
    lhs = self.model.encoder(inputs_embeds = self.emb(x)).last_hidden_state     

    return self.model.decoder.beam_search(input_ids = y,
                                   encoder_hidden_states = lhs,
                                   beam_scorer=beam_scorer,
                                   stopping_criteria = criteria)

model = ViTBert(len(vocab))

The first thing I did was to check if it would work passing random inputs, here I mimic a batch of 3 images of sizes [1, 224, 224] and text inputs and labels of shape [batch, seq_len] in the integer range of 0 to 41 (my vocab length).

x = torch.rand((3, 1, 224, 224))
y = torch.randint(0, 41, (3, 50))
l = torch.randint(0, 41, (3, 50))
model(x, y, l).keys()

odict_keys([‘loss’, ‘logits’, ‘encoder_last_hidden_state’])

I train it using the ‘loss’ provided by the model for a few epochs but when I try to run predictions I get the same output for different images. I did some testing and although the logits are slightly different for different input images, their argmax are always the same.

x = torch.rand((2, 1, 224, 224))
y = torch.cat([torch.randint(0, 41, (1, 50))]*2)
preds= model(x, y).logits.argmax(dim=2)
all(preds[0] == preds[1])

True

Not surprisingly, when I run the beam_search I always get the same result

beam_scorer = BeamSearchScorer(
    batch_size=2,
    num_beams=6,
    device=model.model.decoder.device
)
criteria = StoppingCriteriaList([MaxLengthCriteria(100)])
res = model.beam_search(
     torch.randn((2, 1, 224, 224)),
     torch.cat([torch.tensor([[0]])]*12), 
     beam_scorer, criteria)
all(res[0] == res[1])

True

I am just started learning transformers a few weeks ago and my knowledge is very shallow, so any advice would be helpful (even if not directly related with this problem).

If you interested in checking my colab notebook can be accessed here: Google Colab
The notebook is a bit messy but I enabled commentary on that notebook so if you feel like writing down anything feel free to do so =)

Looking forward to any comment.

Sincerely,
Passos.

johnrodriguez190380 · October 21, 2021, 7:39am

Any luck with this task?

Topic		Replies	Views
Img2seq model with pretrained weights Beginners	7	1215	November 18, 2021
Image Captioning - ViT + BERT with WIT Flax/JAX Projects	2	4078	October 21, 2021
Using EncoderDecoderModel 🤗Transformers	4	1067	October 28, 2021
Image Captioning with ViT and GPT 2 Base Models	2	61	May 10, 2025
Error Training Vision Encoder Decoder for Image Captioning Intermediate	8	2913	June 8, 2024

How can I make a Img2Text transformer using the existent modules?

Related topics