Model() output issue during migration from pytorch_pretrained_bert to transformers

I have the following code snippet that allows me to extract features using bert-base-uncased imported from pytorch_pretrained_bert.modeling import BertModel:

    def extract_bert_features(self, conll_dataset):
        sentences = [[e.form for e in sentence] for sentence in conll_dataset]
        # data loading
        features = []
        for sentence in sentences:
            bert_tokens, map_to_original_tokens = self.convert_to_bert_tokenization(sentence)
            feature = self.from_bert_tokens_to_features(bert_tokens, map_to_original_tokens)
            features.append(feature)
        
        all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
        # mask with 0's for placeholders
        all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
        # tensor with 1...n where n is the number of examples
        all_token_maps = torch.tensor([f.map_to_original_tokens for f in features], dtype=torch.long)
        # indexes that map back dataset
        all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
        
        # create a dataset the resources needed
        eval_data = TensorDataset(all_input_ids, all_input_mask, all_token_maps, all_example_index)
        # create a sampler which will be used to create the batches
        eval_sampler = SequentialSampler(eval_data)
        eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=self.batch_size)

        for input_ids, input_mask, token_maps, example_indices in eval_dataloader:
            input_ids = input_ids.to(self.device)
            input_mask = input_mask.to(self.device)
            ### RUN MODEL:  run model to get all 12 layers of bert ###
            all_encoder_layers, _ = self.model(input_ids, token_type_ids=None, attention_mask=input_mask)
            averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes)

            for i, idx in enumerate(example_indices):
                for j, coll_entry in enumerate(conll_dataset[idx]):
                    if token_maps[i,j] < 511:
                        coll_entry.bert = averaged_output[i,token_maps[i,j]].clone().detach().cpu()
                    else:
                        coll_entry.bert = averaged_output[i,token_maps[i,511]].clone().detach().cpu()

Using bert-base-uncased from pytorch_pretrained_bert everything works correctly because my all_encoder_layers object is a list of 12 hidden layer tensors, allowing me to pick the idx position and taking the average.
In particular the dimensions are:

print("All encoder layers: ", all_encoder_layers)                  # list type
print("Number of layers:", len(all_encoder_layers))                # 12
print("Number of batches:", len(all_encoder_layers[0]))            # 1
print("Number of tokens:", len(all_encoder_layers[0][0]))          # 512
print("Number of hidden units:", len(all_encoder_layers[0][0][0])) # 768
print("Idx: ", self.layer_indexes)                                 # [-1, -2, -3, -4]

print("Averaged_output len: ", len(averaged_output))  # 1
print("Averaged_output dim: ", averaged_output.shape) # torch.Size([1, 512, 768])

However, when I migrate my code to the transformers library importing AutoTokenizer, AutoModel, the resulting all_encoded_layers object is no longer the full list of 12 hidden layers, but a single torch tensor object of shape torch.Size([1, 512, 768]). In particular the dimensions now are:

print("All encoder layers: ", all_encoder_layers)                # tensor type
print("Number of layers:", len(all_encoder_layers))              # 1
print("Number of batches:", len(all_encoder_layers[0]))          # 512
print("Number of tokens:", len(all_encoder_layers[0][0]))        # 768
print("Size of all encoder_layers: ", all_encoder_layers.size()) # torch.Size([1, 512, 768])
print("Idx: ", self.layer_indexes)                               # [-1, -2, -3, -4] 

which results in the following error when I attempt to create the averaged_output:

  File "/.../bert_features.py", line 103, in extract_bert_features
    averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes)
  File "/.../bert_features.py", line 103, in <listcomp>
    averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes)
IndexError: index -2 is out of bounds for dimension 0 with size 1

The migration documentation states that I should take the first element of the all_encoded_layers object as a replacement but is doing this the same as what I was doing previously to create the average?

If the answer is yes, then I’m fine, otherwise do you have any ideas on how I could go about trying to replicate this line averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes) that works for bert-base-uncased from pytorch_pretrained_bert also for pytorch_transformers?

Thanks a lot to everyone!

P.s. The question is also on StackOverflow with the same title.

Edit: The thing that is most obvious immediately is that the shape of averaged_output is the same as all_encoder_layers. In fact, using the latter the code works and excellent results are still obtained. The problem is that by doing so, we do not consider only the last 4 layers, but all of the 12 layers condensed into a single tensor.

Does anyone know if this change is relevant for Bert’s (especially UmBERTo) feature extraction purposes or can I postpone it?