How to use embeddings to compute similarity?

Hi, I would like to compute sentence similarity from an input text and output text using cosine similarity and the embeddings I can get from the Feature Extraction task.

However, I noticed that it returns different dimension matrix, so I cannot perform the matrix calculation. For example, in facebook/bart-base · Hugging Face you’ll get a different matrix size depending on the input text.

Is there anyway to get just a single vector?

Am I thinking about this the right way?

With transformers, the feature-extraction pipeline will retrieve one embedding per token.

If you want a single embedding for the full sentence, you probably want to use the sentence-transformers library. There are some hundreds of st models at HF you can use Models - Hugging Face. You might also want to use a transformers model and do pooling, but I would suggest to just use sentence transformers :slight_smile:

For inference, you can use something like this

import requests

API_URL = "https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-mpnet-base-v2"
headers = {"Authorization": "Bearer API_TOKEN"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

output = query({
    "inputs": ["this is a sentence", "this is another sentence"]
})
# Output is a list of 2 embeddings, each of 768 values.

Thanks for your response @osanseviero but that does not work.

curl \
  -H 'Authorization: Bearer <API KEY>' \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' -X POST \
  --data '{"inputs": ["this is my test"]}' \
  https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-mpnet-base-v2
curl: (18) transfer closed with 7580 bytes remaining to read

This seems more of an issue with how the request is made in curl. I will take a look, but also pinging @Narsil on this

I have been using sentence-transformers to calculate document embeddings and then used them as input for document clustering.

I read somewhere that it is best to use a model that was trained on a downstream task that focuses on text similarity. But a standard Bert model worked good enough in my use case.

Moreover, you can play around with different approaches how the embeddings are extracted, like mean-pooling, [CLS] token, … I never had a metric to compare if one is better than the other - I just did some cluster analysis and looked visually if the clusters made sense.

Here is a small working example (using a German dataset with a German Bert model):

pip install -U -q sentence-transformers
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

dataset = load_dataset("gnad10", split='train[:10%]')
print(dataset.num_rows) # 924

model = SentenceTransformer('deepset/gbert-base')
embeddings = model.encode(dataset["text"])
print(embeddings.shape) # (924, 768)

result = TSNE().fit_transform(embeddings)
print(result.shape)  # (924, 2)

plt.scatter(result[:, 0], result[:, 1])

tsne-clusters

1 Like