How to properly compute Sentence Embeddings using a non english, pretrained distilbert model?


I would like to compute sentence embeddings for Italian, using the BERTino Italian DistilBERT model.

I see two options:

nlp_features = pipeline('feature-extraction', model='indigo-ai/BERTino', tokenizer='indigo-ai/BERTino')

In the second case, it is also not really clear to me what’s the proper way to apply pooling to the words embeddings in order to obtain the sentence embeddings.

A simple - but maybe naive - way is just to apply mean to all the dimensions, such as:

words_embeddings = nlp_features('Il cielo è pieno di stelle che luccicano')
sentence_embeddings = np.array(words_embeddings[0]).mean(axis=0)

By applying afterwards cosine similarity and testing the embeddings with several sentences, it seems to me that the results totally make sense; but I have also noticed that the page of the sentence-transformers model embeds a code snippet that computes mean_pooling and that takes into account the “attention mask”:

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

Should I also consider taking into account “attention mask”? In case:

  • why it is relevant?
  • how can I get it from the feature-extraction pipeline?