Hello,
I am trying to run this embedding model “sentence-transformers/LaBSE” with revision=“836121a0533e5664b21c7aacc5d22951f2b8b25b” on the Inference Endpoints.
I have a result, but the embeddings numbers are different from the local execution. And not even correlated using cosine similarity.
Any idea what’s going on ?
from abc import ABC, abstractmethod
import numpy as np
import requests
from sentence_transformers import SentenceTransformer
from sbw_fiabilis.logger import get_logger, set_level
import os
from dotenv import load_dotenv
logger = get_logger()
class EmbeddingInterface(ABC):
"""Interface abstraite pour les services d'embedding."""
@abstractmethod
def encode(self, texts, batch_size=None, show_progress_bar=False):
pass
class LocalEmbeddingService(EmbeddingInterface):
"""Implémentation locale utilisant SentenceTransformer."""
def __init__(self):
WORKING_DIR = os.getenv("WORKING_DIR", os.path.join(os.path.dirname(__file__), "../../data/working_dir"))
HF_HOME = os.path.join(WORKING_DIR, ".hf")
os.environ["HF_HOME"] = HF_HOME
self.model = SentenceTransformer("sentence-transformers/LaBSE", revision="836121a0533e5664b21c7aacc5d22951f2b8b25b", cache_folder=HF_HOME)
logger.info(f"LocalEmbeddingService configuré")
def encode(self, texts, batch_size=32, show_progress_bar=False):
return self.model.encode(texts, batch_size=batch_size, show_progress_bar=show_progress_bar)
class APIEmbeddingService(EmbeddingInterface):
"""Implémentation utilisant l'API Hugging Face."""
def __init__(self):
self.api_url = os.getenv("EMBEDDING_API_URL")
self.api_key = os.getenv("EMBEDDING_API_KEY")
if not self.api_url or not self.api_key:
raise ValueError("EMBEDDING_API_URL et EMBEDDING_API_KEY doivent être définis")
self.headers = {
"Accept": "application/json",
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
logger.info(f"ApiEmbeddingService configuré")
def _query_api(self, payload):
try:
response = requests.post(self.api_url, headers=self.headers, json=payload, timeout=30)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
logger.error(f"Erreur lors de la requête API: {e}")
raise
def encode(self, texts, batch_size=32, show_progress_bar=False):
if not texts:
return np.array([])
all_embeddings = []
total_texts = len(texts)
logger.info(f"Encodage via API: {total_texts} textes en lots de {batch_size}")
for i in range(0, total_texts, batch_size):
batch = texts[i:i + batch_size]
payload = {
"inputs": batch,
"parameters": {}
}
response = self._query_api(payload)
# Gestion des différents formats de réponse API
if isinstance(response, list):
batch_embeddings = response
elif isinstance(response, dict) and "embeddings" in response:
batch_embeddings = response["embeddings"]
else:
raise ValueError(f"Format de réponse API inattendu: {type(response)}")
all_embeddings.extend(batch_embeddings)
logger.info(f" Lot traité: {min(i + batch_size, total_texts)}/{total_texts}")
return all_embeddings
def test():
logger = get_logger()
set_level("DEBUG")
load_dotenv()
texts = ["toto", "tata"]
service = LocalEmbeddingService()
embeddings = service.encode(texts)
logger.info(embeddings[0][:5])
logger.info(embeddings[1][:5])
service = APIEmbeddingService()
embeddings = service.encode(texts)
logger.info(embeddings[0][:5])
logger.info(embeddings[1][:5])
if __name__ == "__main__":
test()
1 Like
The result with different embeddings.
INFO - Logger level set to INFO
INFO - Logger level set to DEBUG
INFO - LocalEmbeddingService configuré
INFO - [ 0.02300638 -0.07002795 -0.01850945 -0.03634194 0.0507826 ]
INFO - [-0.03088209 -0.05037568 -0.00730146 -0.0068823 0.03126564]
INFO - ApiEmbeddingService configuré
INFO - Encodage via API: 2 textes en lots de 32
INFO - Lot traité: 2/2
INFO - [0.0077932924, 0.015989138, 0.010355308, 0.0026318827, 0.019499298]
INFO - [-0.007399403, -0.03194063, -0.016836794, 0.022840464, 0.001694431]
1 Like
If you select anything other than “Custom,” I think the contents of handler.py
will be ignored. In this case, I think model will be executed with the default arguments of the default pipeline. That may be why there is a difference from the local code.
Thank you John for helping.
I am not using this way of running an endpoint, I am using the no-code approach and the UI is showing the right model with the right version (screenshots).
1 Like
This means that either the library (in this case, TGI and SentenceTransformers) is installed locally or on the endpoint, or the code for the template is simply buggy…
If the repository version specification does not work, that may also be a bug, but if that is the only issue, the cosine similarity should not be extremely off.
As shown below, a fairly old version of the library is used in the endpoint. Of course, it is possible to update it manually…
Indeed the log of the replica doesn’t really seems to take into account any of the params provided in the UI.
The log of the replica :
Args { model_id: “/rep****ory”, revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: “r-rpelissier-sbw-fidi-labse-58w96y74-e4770-0t00y”, port: 80, uds_path: “/tmp/text-embeddings-inference-server”, huggingface_hub_cache: Some(“/repository/cache”), payload_limit: 2000000, api_key: None, json_output: true, disable_spans: false, otlp_endpoint: None, otlp_service_name: “text-embeddings-inference.server”, cors_allow_origin: None }
1 Like
Too bad, if I need to debug this (a paid service).
The purpose of a managed service is to ignore the underlying complexity of provisioning, maintaining versions… I am really disappointed by what seems to be a “tools for POC” but not a production ready service.
And having a mailto:… (that attempt to open my mail desktop app instead of gmail) as the only way to reach the support was another proof that this is not too serious.
1 Like
If it’s for a paid service, using Expert Support is probably the fastest and most reliable option, especially if it seems like a bug.
BTW, on my local PC:
from sentence_transformers import SentenceTransformer # sentence-transformers 4.0.1
import torch
sentences = ["This is an example sentence", "Each sentence is converted"]
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on {device}.") # Running on cuda.
model = SentenceTransformer("sentence-transformers/LaBSE").to(device)
embeddings = model.encode(sentences)
print("main:", embeddings)
#main: [[ 0.02882478 -0.00602382 -0.05947006 ... -0.03002249 -0.029607
# 0.00067482]
# [-0.05550233 0.02546483 -0.02157256 ... 0.02932105 0.01150041
# -0.00848792]]
model = SentenceTransformer("sentence-transformers/LaBSE", revision="836121a0533e5664b21c7aacc5d22951f2b8b25b").to(device)
embeddings = model.encode(sentences)
print("836121a0533e5664b21c7aacc5d22951f2b8b25b:", embeddings)
#836121a0533e5664b21c7aacc5d22951f2b8b25b: [[ 0.02882478 -0.00602382 -0.05947006 ... -0.03002249 -0.029607
# 0.00067482]
# [-0.05550233 0.02546483 -0.02157256 ... 0.02932105 0.01150041
# -0.00848792]]
model.to("cpu")
embeddings = model.encode(sentences)
print("On CPU:", embeddings)
#On CPU: [[ 0.02882476 -0.00602385 -0.05947007 ... -0.03002251 -0.02960699
# 0.00067482]
# [-0.05550234 0.02546484 -0.02157255 ... 0.02932107 0.01150037
# -0.00848786]]
At least locally consistent. Thank you !
1 Like
Hi rpelissier 
Sorry about the hassle here. I did a deep dive on issue and I think I know what’s going on: the model deployed in your inference endpoint uses the TEI server engine. Whereas the local example uses sentence-transformers. And unfortunately there’s a mismatch between the implementations. This model is one of the few that uses a Dense module, which is supported in sentence transformers but not in TEI.
So when the model is ran with TEI (and therefore on inference endpoints), it’s equivalent to doing this in sentence transformers:
from sentence_transformers import SentenceTransformer
import torch
sentences = ["This is an example sentence", "Each sentence is converted"]
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on {device}.")
model = SentenceTransformer("sentence-transformers/LaBSE").to(device)
embeddings = model.encode(sentences)
print("default", embeddings)
edited_model = SentenceTransformer("sentence-transformers/LaBSE").to(device)
del edited_model[2]
embeddings = edited_model.encode(sentences)
print("del model[2]:", embeddings)
this gives the output:
default [[ 0.02882483 -0.00602379 -0.05947006 ... -0.03002251 -0.029607
0.00067482]
[-0.05550232 0.02546485 -0.02157257 ... 0.02932104 0.0115004
-0.00848789]]
del model[2]: [[-0.00814162 0.01150823 -0.01516913 ... -0.02249936 0.02313923
-0.02578063]
[ 0.00584357 0.03796612 0.0039336 ... 0.03305857 0.03542801
0.0157448 ]]
where the former corresponds to the same results in the post above, and the latter should be similar to the model deployed on inference endpoints with TEI.
This is indeed not ideal and I’ve notified the maintainers of TEI so they can work on either supporting the Dense feature or alternatively clearly showing that this model isn’t supported in TEI.
As a potential solution, when you deploy this model on Inference Endpoints, you can select the “Default” container instead of the TEI one. The default container is a simple wrapper around the sentence transformers library, so it’s not as performant as TEI, but it should give you the correct embeddings.
Hopefully this helps 
2 Likes
Thank tou erikkaum, now I understand.
So this feels like a serious bug to have an inference service ignoring some layers of the inference model. A big warning should show, at least.
I am sorry but to me it is a blocker for adoption of your product. It is a nice idea, but not reliable for production. I will give another try in 6 months. In the mean time I will go terraform and some autoscalable docker container. (No so easy though, I have been working on it for the past couple of day, and autoscaling with caching the model weights and with enough CPU, is not really what it was designed for.
1 Like
Hi rpelissier,
I totally understand and agree that it’s a serious bug.
Also just as a heads up: if you deploy this model on your own infra with the text-embedding-inference server, you’ll have the same bug.
So when you deploy on your own infra make sure to use the sentence-transformer implementation so that the embeddings are correct 