Hi,
This could be a very naive question but I’m not able to understand what features are extracted by the “feature-extraction” pipeline. I tried the following so far
text = 'I will learn the embeddings for this sentence now'
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
feature_extractor = pipeline ('feature-extraction', model='bert-base-multilingual-uncased', tokenizer=tokenizer)
try:
features = torch.tensor (feature_extractor (text))
print (features)
except RuntimeError:
print ("Error")
which gives me the following output
tensor([[[ 0.0069, 0.0085, 0.0350, ..., -0.0127, 0.0450, -0.0289],
[ 0.1185, 0.3802, -0.0386, ..., -0.2473, 0.4393, -0.5417],
[-0.1408, -0.2094, -0.1027, ..., -0.0744, 0.3208, -1.0260],
...,
[-0.0517, 0.0047, -0.1229, ..., -0.0555, 0.4420, -0.2788],
[-0.1698, 0.2366, -0.3831, ..., -0.0218, 0.3211, -0.3036],
[-0.4897, 0.3905, -0.1925, ..., -0.0605, 0.2510, -0.8872]]])
However, I then tried to extract the hidden states from the model with the following code:
class BertFeatureExtractor (object):
def __init__ (self, model_name):
self.tokenizer = BertTokenizer.from_pretrained (model_name)
self.model = BertModel.from_pretrained (model_name)
self.model.eval()
def extract (self, text):
try:
encoded_input = self.tokenizer(text, return_tensors='pt')
output = self.model (**encoded_input, output_hidden_states=True)
except RuntimeError:
output = None
print (f'Model cannot learn embeddings for {text}')
return encoded_input, output
I then get the embeddings as:
feat_extractor = BertFeatureExtractor ('bert-base-multilingual-uncased')
with torch.no_grad ():
encoded_input, output = feat_extractor.extract (text)
None of the output['hidden_states']
or output['last_hidden_state']
match the output of the feature-extraction
pipeline. Is that expected? Are the features calculated by taking some combination of the layers? If so, how? Or the feature extraction is from a different way altogether?