@philschmid thank you, I took your advice and made plain experiments in the Notebook.
However, I am experiencing some troubles on putting together my multimodal text and image input such that is accepted by the model. I am now just making a small-scale test on ten data points to properly prepare the inputs.
This is what I have done:
- Generated tokens of my text:
input_ids, token_type_ids, attention_mask
I used the “bert.base-cased” tokenizer for this.
- Generated visual embeddings of my images:
visual_embeds, visual_token_type_ids, visual_attention_mask
I followed this example and updated it with my data to generate the visual embeddings (using the detectron2 library): Google Colab
- I put all these six tensors and the labels (also a tensor) into a dictionary and transformed it into a
Dataset
. Further, I split it into a train and test Dataset.
Here is what my training data set looks like:
Dataset({
features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask', 'visual_embeds', 'visual_token_type_ids', 'visual_attention_mask'],
num_rows: 8
})
- I then perform training as per the below:
from transformers import BertTokenizer, VisualBertForMultipleChoice
from transformers import TrainingArguments
from transformers import Trainer
import torch
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = VisualBertForMultipleChoice.from_pretrained("uclanlp/visualbert-vcr")
training_args = TrainingArguments(
output_dir="output_dir/",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train,
eval_dataset=test,
)
trainer.train()
The issue is that when I run trainer.train()
I receive the following error message:
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
FutureWarning,
***** Running training *****
Num examples = 8
Num Epochs = 3
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 3
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-25-3435b262f1ae> in <module>
----> 1 trainer.train()
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1398 tr_loss_step = self.training_step(model, inputs)
1399 else:
-> 1400 tr_loss_step = self.training_step(model, inputs)
1401
1402 if (
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/trainer.py in training_step(self, model, inputs)
1982
1983 with self.autocast_smart_context_manager():
-> 1984 loss = self.compute_loss(model, inputs)
1985
1986 if self.args.n_gpu > 1:
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
2014 else:
2015 labels = None
-> 2016 outputs = model(**inputs)
2017 # Save past state if it exists
2018 # TODO: this needs to be fixed and made cleaner later.
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/models/visual_bert/modeling_visual_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, visual_embeds, visual_attention_mask, visual_token_type_ids, image_text_alignment, output_attentions, output_hidden_states, return_dict, labels)
1143 output_attentions=output_attentions,
1144 output_hidden_states=output_hidden_states,
-> 1145 return_dict=return_dict,
1146 )
1147
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/models/visual_bert/modeling_visual_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, visual_embeds, visual_attention_mask, visual_token_type_ids, image_text_alignment, output_attentions, output_hidden_states, return_dict)
821 visual_embeds=visual_embeds,
822 visual_token_type_ids=visual_token_type_ids,
--> 823 image_text_alignment=image_text_alignment,
824 )
825
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/models/visual_bert/modeling_visual_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, visual_embeds, visual_token_type_ids, image_text_alignment)
140 )
141
--> 142 visual_embeds = self.visual_projection(visual_embeds)
143 visual_token_type_embeddings = self.visual_token_type_embeddings(visual_token_type_ids)
144
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/linear.py in forward(self, input)
91
92 def forward(self, input: Tensor) -> Tensor:
---> 93 return F.linear(input, self.weight, self.bias)
94
95 def extra_repr(self) -> str:
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/functional.py in linear(input, weight, bias)
1690 ret = torch.addmm(bias, input, weight.t())
1691 else:
-> 1692 output = input.matmul(weight.t())
1693 if bias is not None:
1694 output += bias
RuntimeError: mat1 dim 1 must match mat2 dim 0
The issue seems to origin from image_text_allignment
and visual_embeds
.
- Any ideas on ways to solve this error message?
- Does the overall steps sound like a good approach to prepare the data? Or would you try something else?
Thanks!