Inference result not aligned with local version of same model and revision

rpelissier · June 24, 2025, 10:46am

Hello,
I am trying to run this embedding model “sentence-transformers/LaBSE” with revision=“836121a0533e5664b21c7aacc5d22951f2b8b25b” on the Inference Endpoints.

I have a result, but the embeddings numbers are different from the local execution. And not even correlated using cosine similarity.

Any idea what’s going on ?

from abc import ABC, abstractmethod
import numpy as np
import requests
from sentence_transformers import SentenceTransformer
from sbw_fiabilis.logger import get_logger, set_level
import os
from dotenv import load_dotenv

logger = get_logger()


class EmbeddingInterface(ABC):
    """Interface abstraite pour les services d'embedding."""
    
    @abstractmethod
    def encode(self, texts, batch_size=None, show_progress_bar=False):
        pass


class LocalEmbeddingService(EmbeddingInterface):
    """Implémentation locale utilisant SentenceTransformer."""
    
    def __init__(self):
        WORKING_DIR = os.getenv("WORKING_DIR", os.path.join(os.path.dirname(__file__), "../../data/working_dir"))
        HF_HOME = os.path.join(WORKING_DIR, ".hf")
        os.environ["HF_HOME"] = HF_HOME

        self.model = SentenceTransformer("sentence-transformers/LaBSE", revision="836121a0533e5664b21c7aacc5d22951f2b8b25b", cache_folder=HF_HOME)
        logger.info(f"LocalEmbeddingService configuré")
    
    def encode(self, texts, batch_size=32, show_progress_bar=False):
        return self.model.encode(texts, batch_size=batch_size, show_progress_bar=show_progress_bar)


class APIEmbeddingService(EmbeddingInterface):
    """Implémentation utilisant l'API Hugging Face."""
    
    def __init__(self):
        self.api_url = os.getenv("EMBEDDING_API_URL")
        self.api_key = os.getenv("EMBEDDING_API_KEY")
        if not self.api_url or not self.api_key:
            raise ValueError("EMBEDDING_API_URL et EMBEDDING_API_KEY doivent être définis")
        self.headers = {
            "Accept": "application/json",
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }
        logger.info(f"ApiEmbeddingService configuré")
    
    def _query_api(self, payload):
        try:
            response = requests.post(self.api_url, headers=self.headers, json=payload, timeout=30)
            response.raise_for_status()
            return response.json()
        except requests.exceptions.RequestException as e:
            logger.error(f"Erreur lors de la requête API: {e}")
            raise
    
    def encode(self, texts, batch_size=32, show_progress_bar=False):
        if not texts:
            return np.array([])
        
        all_embeddings = []
        total_texts = len(texts)
        
        logger.info(f"Encodage via API: {total_texts} textes en lots de {batch_size}")
        
        for i in range(0, total_texts, batch_size):
            batch = texts[i:i + batch_size]
            
            payload = {
                "inputs": batch,
                "parameters": {}
            }
            
            response = self._query_api(payload)
            
            # Gestion des différents formats de réponse API
            if isinstance(response, list):
                batch_embeddings = response
            elif isinstance(response, dict) and "embeddings" in response:
                batch_embeddings = response["embeddings"]
            else:
                raise ValueError(f"Format de réponse API inattendu: {type(response)}")
            
            all_embeddings.extend(batch_embeddings)
            
            logger.info(f"  Lot traité: {min(i + batch_size, total_texts)}/{total_texts}")
        
        return all_embeddings





def test():
    logger = get_logger()
    set_level("DEBUG")

    load_dotenv()

    texts = ["toto", "tata"]

    service = LocalEmbeddingService()
    embeddings = service.encode(texts)
    logger.info(embeddings[0][:5])
    logger.info(embeddings[1][:5])

    service = APIEmbeddingService()
    embeddings = service.encode(texts)
    logger.info(embeddings[0][:5])
    logger.info(embeddings[1][:5])

if __name__ == "__main__":
    test()

rpelissier · June 24, 2025, 1:07pm

rpelissier · June 24, 2025, 1:09pm

The result with different embeddings.

INFO - Logger level set to INFO
INFO - Logger level set to DEBUG
INFO - LocalEmbeddingService configuré
INFO - [ 0.02300638 -0.07002795 -0.01850945 -0.03634194  0.0507826 ]
INFO - [-0.03088209 -0.05037568 -0.00730146 -0.0068823   0.03126564]
INFO - ApiEmbeddingService configuré
INFO - Encodage via API: 2 textes en lots de 32
INFO -   Lot traité: 2/2
INFO - [0.0077932924, 0.015989138, 0.010355308, 0.0026318827, 0.019499298]
INFO - [-0.007399403, -0.03194063, -0.016836794, 0.022840464, 0.001694431]

John6666 · June 24, 2025, 1:54pm

If you select anything other than “Custom,” I think the contents of handler.py will be ignored. In this case, I think model will be executed with the default arguments of the default pipeline. That may be why there is a difference from the local code.

rpelissier · June 24, 2025, 2:13pm

Thank you John for helping.
I am not using this way of running an endpoint, I am using the no-code approach and the UI is showing the right model with the right version (screenshots).

John6666 · June 24, 2025, 2:22pm

This means that either the library (in this case, TGI and SentenceTransformers) is installed locally or on the endpoint, or the code for the template is simply buggy…
If the repository version specification does not work, that may also be a bug, but if that is the only issue, the cosine similarity should not be extremely off.

As shown below, a fairly old version of the library is used in the endpoint. Of course, it is possible to update it manually…

rpelissier · June 24, 2025, 2:25pm

Indeed the log of the replica doesn’t really seems to take into account any of the params provided in the UI.

The log of the replica :

Args { model_id: “/rep****ory”, revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: “r-rpelissier-sbw-fidi-labse-58w96y74-e4770-0t00y”, port: 80, uds_path: “/tmp/text-embeddings-inference-server”, huggingface_hub_cache: Some(“/repository/cache”), payload_limit: 2000000, api_key: None, json_output: true, disable_spans: false, otlp_endpoint: None, otlp_service_name: “text-embeddings-inference.server”, cors_allow_origin: None }

rpelissier · June 24, 2025, 2:31pm

Too bad, if I need to debug this (a paid service).
The purpose of a managed service is to ignore the underlying complexity of provisioning, maintaining versions… I am really disappointed by what seems to be a “tools for POC” but not a production ready service.
And having a mailto:… (that attempt to open my mail desktop app instead of gmail) as the only way to reach the support was another proof that this is not too serious.

John6666 · June 24, 2025, 2:37pm

If it’s for a paid service, using Expert Support is probably the fastest and most reliable option, especially if it seems like a bug.

BTW, on my local PC:

from sentence_transformers import SentenceTransformer # sentence-transformers     4.0.1
import torch
sentences = ["This is an example sentence", "Each sentence is converted"]
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on {device}.") # Running on cuda.

model = SentenceTransformer("sentence-transformers/LaBSE").to(device)
embeddings = model.encode(sentences)
print("main:", embeddings)
#main: [[ 0.02882478 -0.00602382 -0.05947006 ... -0.03002249 -0.029607
#   0.00067482]
# [-0.05550233  0.02546483 -0.02157256 ...  0.02932105  0.01150041
#  -0.00848792]]

model = SentenceTransformer("sentence-transformers/LaBSE", revision="836121a0533e5664b21c7aacc5d22951f2b8b25b").to(device)
embeddings = model.encode(sentences)
print("836121a0533e5664b21c7aacc5d22951f2b8b25b:", embeddings)
#836121a0533e5664b21c7aacc5d22951f2b8b25b: [[ 0.02882478 -0.00602382 -0.05947006 ... -0.03002249 -0.029607
#   0.00067482]
# [-0.05550233  0.02546483 -0.02157256 ...  0.02932105  0.01150041
#  -0.00848792]]

model.to("cpu")
embeddings = model.encode(sentences)
print("On CPU:", embeddings)
#On CPU: [[ 0.02882476 -0.00602385 -0.05947007 ... -0.03002251 -0.02960699
#   0.00067482]
# [-0.05550234  0.02546484 -0.02157255 ...  0.02932107  0.01150037
#  -0.00848786]]

rpelissier · June 24, 2025, 3:03pm

At least locally consistent. Thank you !

erikkaum · June 25, 2025, 1:34pm

Hi rpelissier

Sorry about the hassle here. I did a deep dive on issue and I think I know what’s going on: the model deployed in your inference endpoint uses the TEI server engine. Whereas the local example uses sentence-transformers. And unfortunately there’s a mismatch between the implementations. This model is one of the few that uses a Dense module, which is supported in sentence transformers but not in TEI.

So when the model is ran with TEI (and therefore on inference endpoints), it’s equivalent to doing this in sentence transformers:

from sentence_transformers import SentenceTransformer
import torch
sentences = ["This is an example sentence", "Each sentence is converted"]
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on {device}.")

model = SentenceTransformer("sentence-transformers/LaBSE").to(device)
embeddings = model.encode(sentences)
print("default", embeddings)

edited_model = SentenceTransformer("sentence-transformers/LaBSE").to(device)
del edited_model[2]
embeddings = edited_model.encode(sentences)
print("del model[2]:", embeddings)

this gives the output:

default [[ 0.02882483 -0.00602379 -0.05947006 ... -0.03002251 -0.029607
   0.00067482]
 [-0.05550232  0.02546485 -0.02157257 ...  0.02932104  0.0115004
  -0.00848789]]
del model[2]: [[-0.00814162  0.01150823 -0.01516913 ... -0.02249936  0.02313923
  -0.02578063]
 [ 0.00584357  0.03796612  0.0039336  ...  0.03305857  0.03542801
   0.0157448 ]]

where the former corresponds to the same results in the post above, and the latter should be similar to the model deployed on inference endpoints with TEI.

This is indeed not ideal and I’ve notified the maintainers of TEI so they can work on either supporting the Dense feature or alternatively clearly showing that this model isn’t supported in TEI.

As a potential solution, when you deploy this model on Inference Endpoints, you can select the “Default” container instead of the TEI one. The default container is a simple wrapper around the sentence transformers library, so it’s not as performant as TEI, but it should give you the correct embeddings.

Hopefully this helps

John6666 · June 25, 2025, 1:59pm

Thank you, erikkaum!

rpelissier · June 26, 2025, 9:08am

Thank tou erikkaum, now I understand.
So this feels like a serious bug to have an inference service ignoring some layers of the inference model. A big warning should show, at least.
I am sorry but to me it is a blocker for adoption of your product. It is a nice idea, but not reliable for production. I will give another try in 6 months. In the mean time I will go terraform and some autoscalable docker container. (No so easy though, I have been working on it for the past couple of day, and autoscaling with caching the model weights and with enough CPU, is not really what it was designed for.

erikkaum · June 26, 2025, 9:54am

Hi rpelissier,

I totally understand and agree that it’s a serious bug.

Also just as a heads up: if you deploy this model on your own infra with the text-embedding-inference server, you’ll have the same bug.

So when you deploy on your own infra make sure to use the sentence-transformer implementation so that the embeddings are correct

Topic		Replies	Views
Can one get embeddings from an inference API that computes Sentence Similarity (in 2023)? Inference Endpoints on the Hub	0	417	June 3, 2023
Integration Issue with Finetuned Embedding Inference Endpoint Inference Endpoints on the Hub	0	45	November 18, 2024
Can one get an embeddings from an inference API that computes Sentence Similarity? Beginners	9	5332	March 13, 2025
Return embeddings via inference api 🤗Transformers	0	370	January 17, 2023
Inference endpoint Intermediate	1	32	August 11, 2024

Inference result not aligned with local version of same model and revision

Related topics