Using Accelerated Inference API to produce sentense embeddings

Is it possible to use Accelerated Inference API to produce sentense embeddings as described here?

from transformers import AutoTokenizer, AutoModel
import torch

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

#Sentences we want sentence embeddings for
sentences = ['This framework generates embeddings for each input sentence',
             'Sentences are passed as a list of string.',
             'The quick brown fox jumps over the lazy dog.']

#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")

#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')

#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
1 Like

hey @vitali i believe integration with sentence-transformers in the inference API is currently in progress, so maybe @osanseviero can share some details (or whether it’s currently possible)


Hi @vitali!

Your question comes in good time. You can already do this by calling This endpoint is in experimental state at the moment, so things might not be stable.

Note that, as of now, we’re working on deeply integrating sentence-transformers with the Hub. This will be part of the v2 release of the library. Some details:

  • Allow downloading sentence-transformer models from the Hub (PR, merged).
  • Allow uploading sentence-transformer models from the Hub (PR).

We expect to have more exciting results very soon :slight_smile:


Awesome! I will try to it.

Can I infer from your answer that any pipeline can be inferred like this? :slight_smile: That would be totally awesome.

The pipeline is usually inferred from the tags in the model repo. Forcing the pipeline through the API has a risk of misusing a model for a different pipeline, but there are also certain models that support multiple pipelines, so you can use that as well.

1 Like

So If I add the tag “feature-extraction” in the model repo then call to inference API will produce embeddings? what will happen if I add multiple tags, i.e “feature-extraction”, “fill-mask”, “zero-shot-classification”?

Hi @vitali. We currently try to keep things simple: we usually have 1 task per model, and this holds for most models. But in case your model does support other tasks, you can use the API url as above.

Right now there’s no way to validate that the model will work with the task, so that’s why this is not shared widely - it might lead to misusing and getting incorrect results.

1 Like

Thank you very much for your reply, I understand the issue: every transformer model can serve some tasks (make embeddings, MLM, zero-shot classification) but no model can serve all tasks, thus comes a risk of misuse. Perhaps it would make sense to add “capabilities.json” to the model repo to define the list of supported tasks/pipelines based on the model architecture? I think this would clear some confusion amongst users, just a thought. Anyway, this way solves my need to make embeddings for downstream tasks perfectly, thank you very much again.

For community reference, the issue of defining and using model pipelines is also discussed on github.

1 Like

Hey Omar, is this still an experimental API? I can’t seem to find any details about it. Would appreciate you pointing me to some resources if available as I am evaluating some APIs and would like to test yours. Thanks.

Hey @abol3z.

The API is not in experimental anymore, but we’re working in its documentation. You can use to obtain the sentence embeddings.

Let us know if you have any questions!

1 Like

Hello Omar, I was trying to load embeddings via API as discussed in this thread, but I am struggling to find a model that actually supports this method.

So for example, the model supports feature extraction, but the following URL is invalid:

Can you help?
Or maybe share a sample code snippet?


Hi there! :wave: Here is a working end-to-end example for the model you suggested.

import requests

API_URL = ""
headers = {"Authorization": "Bearer TOKEN"}

def query(payload):
	response =, headers=headers, json=payload)
	return response.json()
output = query({
	"inputs": "I like you. I love you",

Thanks sm! this helped a lot.


I’ve used the feature extraction pipeline successfully with sentance-trasformer models.

How do I use it with a model, which requires mean_pooling to be applied to the result, such as E5?

To clarify: get a single array of feature embeddings vs the current result that comes out - 3 arrays for the word “test”.

Thank you!

hi @mtomov, you might need a custom pipeline to process the result, here is a duplicated model implementing the mean_pooling on the request radames/e5-large · Hugging Face, you can also try

import requests

API_URL = ""
headers = {"Authorization": "Bearer xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}

def query(payload):
    response =, headers={}, json=payload)
    return response.json()

embeddings = query({
    "inputs": "query: how much protein should a female eat",