Unable to Process Concurrent User Request

I am using the https://huggingface.co/ktrapeznikov/biobert_v1.1_pubmed_squad_v2/tree/main#
for the question answering model. I am sending around 50 abstracts from Pubmed and then asking the question. It is working fine for 1 user , but when scale to 10 concurrent users the model is taking too long. Can anybody help

There’s a number of avenues that you could use to reduce inference time:

  1. Scale your deployment vertically or horizontally.
  2. Move to a smaller model.
  3. Improve your preprocessing/postprocessing efficiency.