Has anyone deployed sentence transformer into sagemaker as endpoint? The link provides task based approach Deploy Hugging Face models easily with Amazon SageMaker but this is only applicable for nlp pipeline and I cant apply for sentence transformer.
Could the difference come from the network overhead? since you are sending requests to a server? or are you measuring them? what GPU are you using locally?
Hi Phil, no it’s not network overhead because I’m running this code locally. By locally I mean on a SageMaker Studio notebook with a Tesla P4 GPU.
I put together a notebook that benchmarks sentence_transformers vs using transformers as described in the notebook you put together.
I think I narrowed down the performance issue. Generating embeddings on GPU is very fast but calling tolist() on the tensors is taking a lot of time.
It’s clear that sentence_transformers has been optimized for performance but I looked through the code but could not pinpoint what makes it so much faster.
Please have a look at the notebook and let me know what you think:
@remigabillet I think what @philschmid might have been referring to is that if you are using a custom inference script then that prediction endpoint is deployed to another server, which would account for networking delay. Of course I’m new to this so could be wrong, but I am also very interested in running embeddings at scale on very large datasets.
thanks for following up @bryanjj. I’m calling tolist() because the object is a torch and I need to save the vector in a separate database. I don’t know of another way to extract the data.
Hi @remigabillet I’m not sure if you are still looking for help with the tolist(), but for anyone in the future coming across this thread. I believe that converting the tensor directly to a list with tolist() is indeed a very slow operation. What could be tested is to convert the tenser to a numpy array before converting it to a list. Potentially the issue can also arise from having to convert the tensor from the GPU to the CPU.
Personally I have used the following approach: tensor.cpu().numpy().tolist()
Thanks for the follow-up @MitchKlaver. I ended up using a different approach and avoiding Sagemaker altogether. If I try again this way, I’ll refer to your post.
The Implementation in this part is good for a single sentence. If you want to take a list of strings as input as in the official model page (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) you need to make changes to the inference.py file.
Updated predict_fn in inference.py
def predict_fn(input_data, model_and_tokenizer):
# extract model and tokenizer
model, tokenizer = model_and_tokenizer
#! The list is loaded as a string by aws sagemaker and needs to be converted back to a list
sentences = ast.literal_eval(input_data['inputs'])
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings without gradient tracking
with torch.no_grad():
model_output = model(**encoded_input)
# Pool and Normalize the token embeddings
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
#! Originally only first sentence_embedding was returned, now return the entire tensor
return {"vectors": sentence_embeddings.tolist()}
After following through the original posts link you can make inference using the below format:
import json
import boto3
input_str = ["This is an example sentence", "Each sentence is converted"]
input_data = json.dumps(
{"inputs": f"{input_str}"}
)
endpoint_name = "YOUR-ENDPOINT-NAME-IN-SAGEMAKER"
sagemaker_client = boto3.client('sagemaker-runtime', region_name="REGION-OF-SAGEMAAKER-ENDPOINT")
response = sagemaker_client.invoke_endpoint(
EndpointName=endpoint_name,
Body=input_data,
ContentType='application/json',
Accept='application/json'
)
response_json = json.loads(response['Body'].read().decode())