Reduce inference latency of text embedding endpoint

saga0106 · July 8, 2022, 9:11pm

Hello, I fine-tuned a custom DistilBert language model and deployed it as a Sagemaker endpoint (p3.2xlarge) for extracting text embeddings. I pass a text snippet as input and the endpoint return embedding vector. I am looking to improve latency.

When i call the endpoint with a single example, it’s per sample latency is very low (< 4 ms). :
%%time
start_time = time.time()
input= {“inputs”:“Google home is much better than alexa”,
“parameters”: {‘truncation’:True}}
predictor.predict(input)[‘vectors’]
round((time.time()-start_time)*1000,3)

CPU times: user 3.69 ms, sys: 0 ns, total: 3.69 ms
Wall time: 61.4 ms

When i call it on a dataframe column with 37k samples, per sample latency is very high (~110 ms).
def predict_single(t):
return predictor.predict({‘inputs’: t,
‘parameters’: {‘truncation’:True}})[‘vectors’]

%%time
start_time = time.time()
df_test[‘vectors’] = df_test[‘text’].apply(predict_single)
latency = round((time.time()-start_time)*1000/len(df_test),3)
print(f"Testing Sample Size: {len(df_test)}\nInference speed: {latency} ms")

Testing Sample Size: 37099
Inference speed: 110.535 ms
CPU times: user 1min 31s, sys: 4.24 s, total: 1min 35s
Wall time: 1h 8min 20s

The length of text may have a role to play here. The text in dataframe column is much larger in token count. Here is the distribution:
(0, 20): 17896,
(128, 256): 1835,
(20, 50): 10893,
(80, 128): 2286,
(50, 80): 3330,
(512, 10000): 272,
(256, 512): 587

Any ideas how to optimize this?

philschmid · July 12, 2022, 7:05am

Hello @saga0106,

Thanks for opening the thread! To better understand your use-case how long are you samples in the dataframe are they significant longer than the hardcoded example you test? Are you processing them asynchronously?

Topic		Replies	Views
Slow inference using most recent docker image Amazon SageMaker	10	3202	March 21, 2022
50 ms inference, 500 ms latency Inference Endpoints on the Hub	0	184	February 27, 2024
What is the latency expectation of DeBerta when doing batch inference Models	0	365	June 30, 2023
How to speed up Blenderbot inference with Sagemaker? 🤗Transformers	0	412	February 7, 2023
Deploy distiluse-base-multilingual-cased-v2 on Sagemaker Amazon SageMaker	1	484	January 25, 2024

Reduce inference latency of text embedding endpoint

Related topics