How can I make a Img2Text transformer using the existent modules?

I am trying to build a captioning system that inputs an image and outputs a caption.

This is an attempt to solve the Kaggle competition: Bristol-Myers Squibb – Molecular Translation | Kaggle.

For that I tried to build a Bert encoder-decoder module with ViT Embeddings using the following code:

class ViTBert(nn.Module):

  def __init__(self, vocab_size):
    super().__init__()
    config_encoder = BertGenerationConfig(
            vocab_size = vocab_size,
            hidden_size = 256,
            num_hidden_layers = 4,
            num_attention_heads = 4,
            intermediate_size = 1024,
            bos_toke_id = 0,
            eos_token_id = 2
    )
    config_decoder = BertGenerationConfig(
            vocab_size = vocab_size,
            hidden_size = 256,
            num_hidden_layers = 4,
            num_attention_heads = 4,
            intermediate_size = 1024,
            add_cross_attention=True, 
            is_decoder=True,
            bos_toke_id = 0,
            eos_token_id = 2
    )       config_encdec = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
    config_emb = ViTConfig(
        hidden_size = 256,
        num_hidden_layers = 4,
        num_attention_heads = 4,
        intermediate_size = 1024,
        image_size = 224,
        patch_size = 16,
        num_channels = 1
    )
    self.emb = ViTModel(config_emb).embeddings
    self.model = EncoderDecoderModel(config_encdec)

  def forward(self, x, y = None, l = None):

    return self.model(
        inputs_embeds = self.emb(x),
        decoder_input_ids = y,
        labels = l
        )     

  def beam_search(self, x, y, beam_scorer, criteria):
    lhs = self.model.encoder(inputs_embeds = self.emb(x)).last_hidden_state     

    return self.model.decoder.beam_search(input_ids = y,
                                   encoder_hidden_states = lhs,
                                   beam_scorer=beam_scorer,
                                   stopping_criteria = criteria)

model = ViTBert(len(vocab))

The first thing I did was to check if it would work passing random inputs, here I mimic a batch of 3 images of sizes [1, 224, 224] and text inputs and labels of shape [batch, seq_len] in the integer range of 0 to 41 (my vocab length).

x = torch.rand((3, 1, 224, 224))
y = torch.randint(0, 41, (3, 50))
l = torch.randint(0, 41, (3, 50))
model(x, y, l).keys()

odict_keys([‘loss’, ‘logits’, ‘encoder_last_hidden_state’])

I train it using the ‘loss’ provided by the model for a few epochs but when I try to run predictions I get the same output for different images. I did some testing and although the logits are slightly different for different input images, their argmax are always the same.

x = torch.rand((2, 1, 224, 224))
y = torch.cat([torch.randint(0, 41, (1, 50))]*2)
preds= model(x, y).logits.argmax(dim=2)
all(preds[0] == preds[1])

True

Not surprisingly, when I run the beam_search I always get the same result

beam_scorer = BeamSearchScorer(
    batch_size=2,
    num_beams=6,
    device=model.model.decoder.device
)
criteria = StoppingCriteriaList([MaxLengthCriteria(100)])
res = model.beam_search(
     torch.randn((2, 1, 224, 224)),
     torch.cat([torch.tensor([[0]])]*12), 
     beam_scorer, criteria)
all(res[0] == res[1])

True

I am just started learning transformers a few weeks ago and my knowledge is very shallow, so any advice would be helpful (even if not directly related with this problem).

If you interested in checking my colab notebook can be accessed here: Google Colaboratory
The notebook is a bit messy but I enabled commentary on that notebook so if you feel like writing down anything feel free to do so =)

Looking forward to any comment.

Sincerely,
Passos.

1 Like

Any luck with this task?