Reduce inference latency of text embedding endpoint

Hello, I fine-tuned a custom DistilBert language model and deployed it as a Sagemaker endpoint (p3.2xlarge) for extracting text embeddings. I pass a text snippet as input and the endpoint return embedding vector. I am looking to improve latency.

When i call the endpoint with a single example, it’s per sample latency is very low (< 4 ms). :
start_time = time.time()
input= {“inputs”:“Google home is much better than alexa”,
“parameters”: {‘truncation’:True}}

CPU times: user 3.69 ms, sys: 0 ns, total: 3.69 ms
Wall time: 61.4 ms

When i call it on a dataframe column with 37k samples, per sample latency is very high (~110 ms).
def predict_single(t):
return predictor.predict({‘inputs’: t,
‘parameters’: {‘truncation’:True}})[‘vectors’]

start_time = time.time()
df_test[‘vectors’] = df_test[‘text’].apply(predict_single)
latency = round((time.time()-start_time)*1000/len(df_test),3)
print(f"Testing Sample Size: {len(df_test)}\nInference speed: {latency} ms")

Testing Sample Size: 37099
Inference speed: 110.535 ms
CPU times: user 1min 31s, sys: 4.24 s, total: 1min 35s
Wall time: 1h 8min 20s

The length of text may have a role to play here. The text in dataframe column is much larger in token count. Here is the distribution:
(0, 20): 17896,
(128, 256): 1835,
(20, 50): 10893,
(80, 128): 2286,
(50, 80): 3330,
(512, 10000): 272,
(256, 512): 587

Any ideas how to optimize this?

Hello @saga0106,

Thanks for opening the thread! To better understand your use-case how long are you samples in the dataframe are they significant longer than the hardcoded example you test? Are you processing them asynchronously?