Deploying Sentence Transformer as sagemaker endpoint

Hello,

Has anyone deployed sentence transformer into sagemaker as endpoint? The link provides task based approach Deploy Hugging Face models easily with Amazon SageMaker but this is only applicable for nlp pipeline and I cant apply for sentence transformer.

Thanks

1 Like

I’m going through the exact same thing. I’ve found this. It works, and I have a serverless inference endpoint deployed.

I’d prefer to use the sentence-transformer package as I’m not familiar with how to work with AutoTokenizer.

1 Like

Thanks a lot for this nice article. So we need to change to ensure this is created as serverless?

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge"
    )

Yes by changing the instance configuration you can deploy the model as serverless endpoint.

@philschmid thanks a lot for putting together the guide. I followed the instructions and the Sagemaker endpoint working.

The problem is that I’m seeing very low performance.

  • on the ml.g4dn.xlarge endpoint deployed using instructions: 3 vectors/second
  • on a SageMaker studio ml.g4dn.xlarge notebook using CPU: 50 vectors/second
  • on a SageMaker studio ml.g4dn.xlarge notebook using GPU: 800 vectors/second

Any ideas why performance is so much lower on the endpoint?

Did you use a custom inference.py? Can you please share the code how you deployed the model?

Don’t forget there is network overhead when you send requests from a notebook environment to your endpoint, but 20x/250x sounds way too much.

Thanks for the quick follow-up @philschmid.

I copy/pasted the code from the article you wrote but here is the code in case you notice something. I ran it on a Sagemaker Studio notebook.

1 Like

You are not moving your model to gpu

Can you clarify what I’m supposed to do differently?

thank you @philschmid I’m making some progress.

I managed to enable GPU on the endpoint by setting the device explicitly, I’m seeing 100x performance at around 300 embeddings/second. :smiley:

When using sentence_transformers on the same GPU hardware, I get about 750 embeddings/second.

Any ideas how I can speed up the custom inference script to perform like sentence_transformers.encode()?

Could the difference come from the network overhead? since you are sending requests to a server? or are you measuring them? what GPU are you using locally?

Hi Phil, no it’s not network overhead because I’m running this code locally. By locally I mean on a SageMaker Studio notebook with a Tesla P4 GPU.

I put together a notebook that benchmarks sentence_transformers vs using transformers as described in the notebook you put together.

I think I narrowed down the performance issue. Generating embeddings on GPU is very fast but calling tolist() on the tensors is taking a lot of time.

It’s clear that sentence_transformers has been optimized for performance but I looked through the code but could not pinpoint what makes it so much faster.

Please have a look at the notebook and let me know what you think:

Hey just following this because I am doing something similar. I actually started with this guide: Building AI-powered search in PostgreSQL using Amazon SageMaker and pgvector | AWS Database Blog. That post demonstrates setting HF_TASK to “feature-extraction” for embeddings, but it also does cls_pooling instead of mean_pooling for reasons I don’t understand.

@remigabillet I think what @philschmid might have been referring to is that if you are using a custom inference script then that prediction endpoint is deployed to another server, which would account for networking delay. Of course I’m new to this so could be wrong, but I am also very interested in running embeddings at scale on very large datasets.

Ok I see even when your run it all locally you are concerned about the tolist() time, but why call tolist() at all? just keep it as a numpy array?

thanks for following up @bryanjj. I’m calling tolist() because the object is a torch and I need to save the vector in a separate database. I don’t know of another way to extract the data. :person_shrugging:

Hi @remigabillet I’m not sure if you are still looking for help with the tolist(), but for anyone in the future coming across this thread. I believe that converting the tensor directly to a list with tolist() is indeed a very slow operation. What could be tested is to convert the tenser to a numpy array before converting it to a list. Potentially the issue can also arise from having to convert the tensor from the GPU to the CPU.

Personally I have used the following approach: tensor.cpu().numpy().tolist()

Thanks for the follow-up @MitchKlaver. I ended up using a different approach and avoiding Sagemaker altogether. If I try again this way, I’ll refer to your post.

The Implementation in this part is good for a single sentence. If you want to take a list of strings as input as in the official model page (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) you need to make changes to the inference.py file.

Updated predict_fn in inference.py

def predict_fn(input_data, model_and_tokenizer):
    
    # extract model and tokenizer 
    model, tokenizer = model_and_tokenizer
    
    #! The list is loaded as a string by aws sagemaker and needs to be converted back to a list
    sentences = ast.literal_eval(input_data['inputs'])

    # Tokenize sentences
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings without gradient tracking
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Pool and Normalize the token embeddings
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

    #! Originally only first sentence_embedding was returned, now return the entire tensor
    return {"vectors": sentence_embeddings.tolist()}

After following through the original posts link you can make inference using the below format:

import json
import boto3


input_str = ["This is an example sentence", "Each sentence is converted"]
input_data = json.dumps(
            {"inputs": f"{input_str}"}
)

endpoint_name = "YOUR-ENDPOINT-NAME-IN-SAGEMAKER"
sagemaker_client = boto3.client('sagemaker-runtime', region_name="REGION-OF-SAGEMAAKER-ENDPOINT")


response = sagemaker_client.invoke_endpoint(
            EndpointName=endpoint_name,
            Body=input_data,
            ContentType='application/json',
            Accept='application/json'
        )
response_json = json.loads(response['Body'].read().decode())