(Tips) Optimizing Underutilized Resources

Hi there, I’d like to know your opinion on how to optimize the following setup.

This dedicated endpoint is using 1 NVIDIA TESLA T4 16GB serving sentence-transformers/clip-ViT-B-32-multilingual-v1 · Hugging Face sentence embeddings. I hit the endpoint with chunks (10000 sentences) in each time.

The issue here is, GPU usage seems to be fully allocated, but other resources seem underutilized. I wonder if you have any advise on optimization ? Maybe async querying, or increase batch size to more than 10k, or other ideas.

Cheers