Deploying Sentence Transformer as sagemaker endpoint

amitkayal · October 14, 2022, 5:53pm

Hello,

Has anyone deployed sentence transformer into sagemaker as endpoint? The link provides task based approach Deploy Hugging Face models easily with Amazon SageMaker but this is only applicable for nlp pipeline and I cant apply for sentence transformer.

Thanks

philondrejack · October 14, 2022, 8:31pm

I’m going through the exact same thing. I’ve found this. It works, and I have a serverless inference endpoint deployed.

I’d prefer to use the sentence-transformer package as I’m not familiar with how to work with AutoTokenizer.

amitkayal · October 15, 2022, 1:31pm

Thanks a lot for this nice article. So we need to change to ensure this is created as serverless?

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge"
    )

philschmid · October 17, 2022, 5:41am

Yes by changing the instance configuration you can deploy the model as serverless endpoint.

remigabillet · May 15, 2023, 8:30am

@philschmid thanks a lot for putting together the guide. I followed the instructions and the Sagemaker endpoint working.

The problem is that I’m seeing very low performance.

on the ml.g4dn.xlarge endpoint deployed using instructions: 3 vectors/second
on a SageMaker studio ml.g4dn.xlarge notebook using CPU: 50 vectors/second
on a SageMaker studio ml.g4dn.xlarge notebook using GPU: 800 vectors/second

Any ideas why performance is so much lower on the endpoint?

philschmid · May 15, 2023, 10:44am

Did you use a custom inference.py? Can you please share the code how you deployed the model?

Don’t forget there is network overhead when you send requests from a notebook environment to your endpoint, but 20x/250x sounds way too much.

remigabillet · May 15, 2023, 6:09pm

Thanks for the quick follow-up @philschmid.

I copy/pasted the code from the article you wrote but here is the code in case you notice something. I ran it on a Sagemaker Studio notebook.

gist.github.com

https://gist.github.com/remigabillet/a07b4d6604e05833b87460e28a2e1641

deploying_sentence_transformers.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b7625a72-e390-44d3-85d6-e73c972552c1",
   "metadata": {},
   "source": [
    "# HuggingFace embeddings on Sagemaker\n",
    "\n",
    "Following instructions from https://www.philschmid.de/custom-inference-huggingface-sagemaker\n",

This file has been truncated. show original

philschmid · May 16, 2023, 8:23am

You are not moving your model to gpu

remigabillet · May 16, 2023, 9:03am

Can you clarify what I’m supposed to do differently?

philschmid · May 16, 2023, 4:15pm

remigabillet · May 17, 2023, 4:27pm

thank you @philschmid I’m making some progress.

I managed to enable GPU on the endpoint by setting the device explicitly, I’m seeing 100x performance at around 300 embeddings/second.

When using sentence_transformers on the same GPU hardware, I get about 750 embeddings/second.

Any ideas how I can speed up the custom inference script to perform like sentence_transformers.encode()?

philschmid · May 18, 2023, 9:34am

Could the difference come from the network overhead? since you are sending requests to a server? or are you measuring them? what GPU are you using locally?

remigabillet · May 18, 2023, 3:05pm

Hi Phil, no it’s not network overhead because I’m running this code locally. By locally I mean on a SageMaker Studio notebook with a Tesla P4 GPU.

I put together a notebook that benchmarks sentence_transformers vs using transformers as described in the notebook you put together.

I think I narrowed down the performance issue. Generating embeddings on GPU is very fast but calling tolist() on the tensors is taking a lot of time.

It’s clear that sentence_transformers has been optimized for performance but I looked through the code but could not pinpoint what makes it so much faster.

Please have a look at the notebook and let me know what you think:

gist.github.com

https://gist.github.com/remigabillet/01c73c26ad93c89016159a395882cc8a

benchmarking all-MiniLM-L6-v2.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "db26604a-f3d6-4e16-ae03-6ef498124f55",
   "metadata": {},
   "source": [
    "### Setup"
   ]
  },

This file has been truncated. show original

bryanjj · May 23, 2023, 3:46pm

Hey just following this because I am doing something similar. I actually started with this guide: Building AI-powered search in PostgreSQL using Amazon SageMaker and pgvector | AWS Database Blog. That post demonstrates setting HF_TASK to “feature-extraction” for embeddings, but it also does cls_pooling instead of mean_pooling for reasons I don’t understand.

@remigabillet I think what @philschmid might have been referring to is that if you are using a custom inference script then that prediction endpoint is deployed to another server, which would account for networking delay. Of course I’m new to this so could be wrong, but I am also very interested in running embeddings at scale on very large datasets.

bryanjj · May 23, 2023, 5:16pm

Ok I see even when your run it all locally you are concerned about the tolist() time, but why call tolist() at all? just keep it as a numpy array?

remigabillet · June 2, 2023, 12:53pm

thanks for following up @bryanjj. I’m calling tolist() because the object is a torch and I need to save the vector in a separate database. I don’t know of another way to extract the data.

MitchKlaver · October 12, 2023, 10:06pm

Hi @remigabillet I’m not sure if you are still looking for help with the tolist(), but for anyone in the future coming across this thread. I believe that converting the tensor directly to a list with tolist() is indeed a very slow operation. What could be tested is to convert the tenser to a numpy array before converting it to a list. Potentially the issue can also arise from having to convert the tensor from the GPU to the CPU.

Personally I have used the following approach: tensor.cpu().numpy().tolist()

remigabillet · October 19, 2023, 10:09pm

Thanks for the follow-up @MitchKlaver. I ended up using a different approach and avoiding Sagemaker altogether. If I try again this way, I’ll refer to your post.

shozi218 · March 26, 2024, 3:29pm

The Implementation in this part is good for a single sentence. If you want to take a list of strings as input as in the official model page (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) you need to make changes to the inference.py file.

Updated predict_fn in inference.py

def predict_fn(input_data, model_and_tokenizer):
    
    # extract model and tokenizer 
    model, tokenizer = model_and_tokenizer
    
    #! The list is loaded as a string by aws sagemaker and needs to be converted back to a list
    sentences = ast.literal_eval(input_data['inputs'])

    # Tokenize sentences
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings without gradient tracking
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Pool and Normalize the token embeddings
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

    #! Originally only first sentence_embedding was returned, now return the entire tensor
    return {"vectors": sentence_embeddings.tolist()}

After following through the original posts link you can make inference using the below format:

import json
import boto3


input_str = ["This is an example sentence", "Each sentence is converted"]
input_data = json.dumps(
            {"inputs": f"{input_str}"}
)

endpoint_name = "YOUR-ENDPOINT-NAME-IN-SAGEMAKER"
sagemaker_client = boto3.client('sagemaker-runtime', region_name="REGION-OF-SAGEMAAKER-ENDPOINT")


response = sagemaker_client.invoke_endpoint(
            EndpointName=endpoint_name,
            Body=input_data,
            ContentType='application/json',
            Accept='application/json'
        )
response_json = json.loads(response['Body'].read().decode())

Topic		Replies	Views
Endpoint Deployment Amazon SageMaker	1	1109	September 20, 2021
Serverless deploy troubles Amazon SageMaker	5	1447	May 16, 2022
Sentence similarity models on Sagemaker Amazon SageMaker	6	2657	January 12, 2024
Can I train and deploy a sentence transformer model using Huggingface estimator Models	0	641	June 6, 2022
Transformers 4.9.0 on SageMaker Amazon SageMaker	12	1968	March 25, 2022

Deploying Sentence Transformer as sagemaker endpoint

Related topics