Run detectron2 for feature extraction in SageMaker notebook

Dear HuggingFace/SageMaker pros,

I am trying to produce visual embeddings to fine-tune a VisualBERT model: VisualBERT

There is one example “Generate Embeddings for VisualBERT” that uses detectron2 to do this: VisualBERT

However, this notebook is for Colab. I can run inside Colab, but when I download the ipynb and try to run it in SageMaker I cannot download detectron2…

I try to run:
python -m pip install ‘git+https://github.com/facebookresearch/detectron2.git’

As well as:
git clone GitHub - facebookresearch/detectron2: Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
python -m pip install -e detectron2

But both fails to install detectron2…

Would anyone have an idea on how I can run the “Generate Embeddings for VisualBERT” example inside SageMaker? (Either how to install detectron2, or a way to bypass this.)

Or would you know some other way to successfully extract the visual embedding in a way that VisualBERT would accept?

Many thanks for your help! Best, Petrus

@Petrus you can take a look at Sagemaker Serverless Inference for LayoutLMv2 model - #15 by mansimov
to see how to install detectron2 properly.

Are you trying to fine-tune VisualBERT in a SageMaker Notebook or using the HuggingFace.fit() estimator=

@philschmid Thanks a lot for your reply!

That thread was very informative! I am now able to successfully download detectron2 inside my SageMaker notbook.

What eventually worked for me to successfully download detectron2 inside SageMaker was the following:

  1. Use a regular ‘python3’ version notebook.
  2. Run pip3 install torch torchvision
  3. Follow the detectron2 installation shown in this example: Google Colab

I ran the following:

!pip install pyyaml==5.1

import torch
TORCH_VERSION = ".".join(torch.__version__.split(".")[:2])
CUDA_VERSION = torch.__version__.split("+")[-1]
print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)

!pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/$CUDA_VERSION/torch$TORCH_VERSION/index.html
  1. After that I could successfully import detectron2
1 Like

@philschmid To answer your second point on whether I want to just fine-tune inside a SM notebook, or specifically use an estimator object for training, I would say that either is fine - I want to try both and go with the one that quickest helps me to have a model that I can experiment with. Given that, any particular advice you would give me on my VisualBERT model building journey?

The easiest is to experiment in the Notebook and then move to a managed Training Job when training e2e on the full dataset.

@philschmid thank you, I took your advice and made plain experiments in the Notebook.

However, I am experiencing some troubles on putting together my multimodal text and image input such that is accepted by the model. I am now just making a small-scale test on ten data points to properly prepare the inputs.

This is what I have done:

  1. Generated tokens of my text: input_ids, token_type_ids, attention_mask
    I used the “bert.base-cased” tokenizer for this.
  2. Generated visual embeddings of my images: visual_embeds, visual_token_type_ids, visual_attention_mask
    I followed this example and updated it with my data to generate the visual embeddings (using the detectron2 library): Google Colab
  3. I put all these six tensors and the labels (also a tensor) into a dictionary and transformed it into a Dataset. Further, I split it into a train and test Dataset.
    Here is what my training data set looks like:
Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask', 'visual_embeds', 'visual_token_type_ids', 'visual_attention_mask'],
    num_rows: 8
})
  1. I then perform training as per the below:
from transformers import BertTokenizer, VisualBertForMultipleChoice
from transformers import TrainingArguments
from transformers import Trainer
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = VisualBertForMultipleChoice.from_pretrained("uclanlp/visualbert-vcr")

training_args = TrainingArguments(
    output_dir="output_dir/",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=test,
)

trainer.train()

The issue is that when I run trainer.train() I receive the following error message:

/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 8
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-25-3435b262f1ae> in <module>
----> 1 trainer.train()

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1398                         tr_loss_step = self.training_step(model, inputs)
   1399                 else:
-> 1400                     tr_loss_step = self.training_step(model, inputs)
   1401 
   1402                 if (

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/trainer.py in training_step(self, model, inputs)
   1982 
   1983         with self.autocast_smart_context_manager():
-> 1984             loss = self.compute_loss(model, inputs)
   1985 
   1986         if self.args.n_gpu > 1:

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   2014         else:
   2015             labels = None
-> 2016         outputs = model(**inputs)
   2017         # Save past state if it exists
   2018         # TODO: this needs to be fixed and made cleaner later.

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/models/visual_bert/modeling_visual_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, visual_embeds, visual_attention_mask, visual_token_type_ids, image_text_alignment, output_attentions, output_hidden_states, return_dict, labels)
   1143             output_attentions=output_attentions,
   1144             output_hidden_states=output_hidden_states,
-> 1145             return_dict=return_dict,
   1146         )
   1147 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/models/visual_bert/modeling_visual_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, visual_embeds, visual_attention_mask, visual_token_type_ids, image_text_alignment, output_attentions, output_hidden_states, return_dict)
    821             visual_embeds=visual_embeds,
    822             visual_token_type_ids=visual_token_type_ids,
--> 823             image_text_alignment=image_text_alignment,
    824         )
    825 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/models/visual_bert/modeling_visual_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, visual_embeds, visual_token_type_ids, image_text_alignment)
    140                 )
    141 
--> 142             visual_embeds = self.visual_projection(visual_embeds)
    143             visual_token_type_embeddings = self.visual_token_type_embeddings(visual_token_type_ids)
    144 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/linear.py in forward(self, input)
     91 
     92     def forward(self, input: Tensor) -> Tensor:
---> 93         return F.linear(input, self.weight, self.bias)
     94 
     95     def extra_repr(self) -> str:

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/functional.py in linear(input, weight, bias)
   1690         ret = torch.addmm(bias, input, weight.t())
   1691     else:
-> 1692         output = input.matmul(weight.t())
   1693         if bias is not None:
   1694             output += bias

RuntimeError: mat1 dim 1 must match mat2 dim 0

The issue seems to origin from image_text_allignment and visual_embeds.

  • Any ideas on ways to solve this error message?
  • Does the overall steps sound like a good approach to prepare the data? Or would you try something else?

Thanks!

@Petrus would you mind opening a new thread in Beginners - Hugging Face Forums. This topic is not really SageMaker specific and might be pretty interesting for other community members.

Also, i am not sure if you have done this but we provide nice documentation for all models including VisualBertForMultipleChoice with example on how to do a forward pass.

@philschmid I am happy to make a post in that thread.

But just to respond on your second point: Yes, I have seen that example and tried to follow it. But am I right when I say that their way of concatenating the inputs as a dictionary only works for making predictions? While, for training, you need to use a pytorch Dataset, right?

Thank you Philipp!