How to properly compute Sentence Embeddings using a non english, pretrained distilbert model?

piercarlos · April 25, 2021, 2:57pm

Hi,

I would like to compute sentence embeddings for Italian, using the BERTino Italian DistilBERT model.

I see two options:

use Sentence Transformers (GitHub - UKPLab/sentence-transformers: Sentence Embeddings with BERT & XLNet), which however it’s not clear to me whether support DistilBERT
use the feature-extraction pipeline in the following simple way:

nlp_features = pipeline('feature-extraction', model='indigo-ai/BERTino', tokenizer='indigo-ai/BERTino')

In the second case, it is also not really clear to me what’s the proper way to apply pooling to the words embeddings in order to obtain the sentence embeddings.

A simple - but maybe naive - way is just to apply mean to all the dimensions, such as:

words_embeddings = nlp_features('Il cielo è pieno di stelle che luccicano')
sentence_embeddings = np.array(words_embeddings[0]).mean(axis=0)

By applying afterwards cosine similarity and testing the embeddings with several sentences, it seems to me that the results totally make sense; but I have also noticed that the page of the sentence-transformers model embeds a code snippet that computes mean_pooling and that takes into account the “attention mask”:


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

Should I also consider taking into account “attention mask”? In case:

why it is relevant?
how can I get it from the feature-extraction pipeline?

Topic		Replies	Views
Distilbert-base-nli-stsb-mean-tokens OOM encoding sentences of 100K docs Beginners	4	685	February 9, 2021
Extracting embeddings with distilbert? (in tensorflow) 🤗Transformers	5	2999	August 6, 2021
DistilBERT and CLS token Beginners	2	2447	February 21, 2021
Pool [CLS] token from DistilBERT 🤗Transformers	1	790	January 18, 2022
Computing similarity between sentences Intermediate	4	3279	July 31, 2021

How to properly compute Sentence Embeddings using a non english, pretrained distilbert model?

Related topics