I have the following code snippet that allows me to extract features using bert-base-uncased
imported from pytorch_pretrained_bert.modeling import BertModel
:
def extract_bert_features(self, conll_dataset):
sentences = [[e.form for e in sentence] for sentence in conll_dataset]
# data loading
features = []
for sentence in sentences:
bert_tokens, map_to_original_tokens = self.convert_to_bert_tokenization(sentence)
feature = self.from_bert_tokens_to_features(bert_tokens, map_to_original_tokens)
features.append(feature)
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
# mask with 0's for placeholders
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
# tensor with 1...n where n is the number of examples
all_token_maps = torch.tensor([f.map_to_original_tokens for f in features], dtype=torch.long)
# indexes that map back dataset
all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
# create a dataset the resources needed
eval_data = TensorDataset(all_input_ids, all_input_mask, all_token_maps, all_example_index)
# create a sampler which will be used to create the batches
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=self.batch_size)
for input_ids, input_mask, token_maps, example_indices in eval_dataloader:
input_ids = input_ids.to(self.device)
input_mask = input_mask.to(self.device)
### RUN MODEL: run model to get all 12 layers of bert ###
all_encoder_layers, _ = self.model(input_ids, token_type_ids=None, attention_mask=input_mask)
averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes)
for i, idx in enumerate(example_indices):
for j, coll_entry in enumerate(conll_dataset[idx]):
if token_maps[i,j] < 511:
coll_entry.bert = averaged_output[i,token_maps[i,j]].clone().detach().cpu()
else:
coll_entry.bert = averaged_output[i,token_maps[i,511]].clone().detach().cpu()
Using bert-base-uncased
from pytorch_pretrained_bert
everything works correctly because my all_encoder_layers
object is a list of 12 hidden layer tensors, allowing me to pick the idx
position and taking the average.
In particular the dimensions are:
print("All encoder layers: ", all_encoder_layers) # list type
print("Number of layers:", len(all_encoder_layers)) # 12
print("Number of batches:", len(all_encoder_layers[0])) # 1
print("Number of tokens:", len(all_encoder_layers[0][0])) # 512
print("Number of hidden units:", len(all_encoder_layers[0][0][0])) # 768
print("Idx: ", self.layer_indexes) # [-1, -2, -3, -4]
print("Averaged_output len: ", len(averaged_output)) # 1
print("Averaged_output dim: ", averaged_output.shape) # torch.Size([1, 512, 768])
However, when I migrate my code to the transformers
library importing AutoTokenizer, AutoModel
, the resulting all_encoded_layers
object is no longer the full list of 12 hidden layers, but a single torch tensor object of shape torch.Size([1, 512, 768])
. In particular the dimensions now are:
print("All encoder layers: ", all_encoder_layers) # tensor type
print("Number of layers:", len(all_encoder_layers)) # 1
print("Number of batches:", len(all_encoder_layers[0])) # 512
print("Number of tokens:", len(all_encoder_layers[0][0])) # 768
print("Size of all encoder_layers: ", all_encoder_layers.size()) # torch.Size([1, 512, 768])
print("Idx: ", self.layer_indexes) # [-1, -2, -3, -4]
which results in the following error when I attempt to create the averaged_output
:
File "/.../bert_features.py", line 103, in extract_bert_features
averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes)
File "/.../bert_features.py", line 103, in <listcomp>
averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes)
IndexError: index -2 is out of bounds for dimension 0 with size 1
The migration documentation states that I should take the first element of the all_encoded_layers
object as a replacement but is doing this the same as what I was doing previously to create the average?
If the answer is yes, then I’m fine, otherwise do you have any ideas on how I could go about trying to replicate this line averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes)
that works for bert-base-uncased
from pytorch_pretrained_bert
also for pytorch_transformers
?
Thanks a lot to everyone!
P.s. The question is also on StackOverflow with the same title.
Edit: The thing that is most obvious immediately is that the shape of averaged_output
is the same as all_encoder_layers
. In fact, using the latter the code works and excellent results are still obtained. The problem is that by doing so, we do not consider only the last 4 layers, but all of the 12 layers condensed into a single tensor.
Does anyone know if this change is relevant for Bert’s (especially UmBERTo) feature extraction purposes or can I postpone it?