Hi,
I would like to compute sentence embeddings for Italian, using the BERTino Italian DistilBERT model.
I see two options:
- use Sentence Transformers (GitHub - UKPLab/sentence-transformers: Sentence Embeddings with BERT & XLNet), which however it’s not clear to me whether support DistilBERT
- use the feature-extraction pipeline in the following simple way:
nlp_features = pipeline('feature-extraction', model='indigo-ai/BERTino', tokenizer='indigo-ai/BERTino')
In the second case, it is also not really clear to me what’s the proper way to apply pooling to the words embeddings in order to obtain the sentence embeddings.
A simple - but maybe naive - way is just to apply mean to all the dimensions, such as:
words_embeddings = nlp_features('Il cielo è pieno di stelle che luccicano')
sentence_embeddings = np.array(words_embeddings[0]).mean(axis=0)
By applying afterwards cosine similarity and testing the embeddings with several sentences, it seems to me that the results totally make sense; but I have also noticed that the page of the sentence-transformers model embeds a code snippet that computes mean_pooling
and that takes into account the “attention mask”:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return sum_embeddings / sum_mask
Should I also consider taking into account “attention mask”? In case:
- why it is relevant?
- how can I get it from the feature-extraction pipeline?