Hi, I would like to compute sentence similarity from an input text and output text using cosine similarity and the embeddings I can get from the Feature Extraction task.
However, I noticed that it returns different dimension matrix, so I cannot perform the matrix calculation. For example, in facebook/bart-base · Hugging Face you’ll get a different matrix size depending on the input text.
Is there anyway to get just a single vector?
Am I thinking about this the right way?
With transformers
, the feature-extraction
pipeline will retrieve one embedding per token.
If you want a single embedding for the full sentence, you probably want to use the sentence-transformers
library. There are some hundreds of st models at HF you can use Models - Hugging Face. You might also want to use a transformers model and do pooling, but I would suggest to just use sentence transformers
For inference, you can use something like this
import requests
API_URL = "https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-mpnet-base-v2"
headers = {"Authorization": "Bearer API_TOKEN"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": ["this is a sentence", "this is another sentence"]
})
# Output is a list of 2 embeddings, each of 768 values.
Thanks for your response @osanseviero but that does not work.
curl \
-H 'Authorization: Bearer <API KEY>' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' -X POST \
--data '{"inputs": ["this is my test"]}' \
https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-mpnet-base-v2
curl: (18) transfer closed with 7580 bytes remaining to read
This seems more of an issue with how the request is made in curl. I will take a look, but also pinging @Narsil on this
I have been using sentence-transformers
to calculate document embeddings and then used them as input for document clustering.
I read somewhere that it is best to use a model that was trained on a downstream task that focuses on text similarity. But a standard Bert model worked good enough in my use case.
Moreover, you can play around with different approaches how the embeddings are extracted, like mean-pooling, [CLS]
token, … I never had a metric to compare if one is better than the other - I just did some cluster analysis and looked visually if the clusters made sense.
Here is a small working example (using a German dataset with a German Bert model):
pip install -U -q sentence-transformers
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
dataset = load_dataset("gnad10", split='train[:10%]')
print(dataset.num_rows) # 924
model = SentenceTransformer('deepset/gbert-base')
embeddings = model.encode(dataset["text"])
print(embeddings.shape) # (924, 768)
result = TSNE().fit_transform(embeddings)
print(result.shape) # (924, 2)
plt.scatter(result[:, 0], result[:, 1])
1 Like