I am trying to build a QA model which gets a question from front end and pass it backs the answers. The latency time is 95 seconds right now which is very high for the online application.
Internally, my backend code fetches a maximum of 100 contexts related to the user’s question from a database and fetches the answers. I would take top 20 best answers based on business logic.
My current logic is looping thru each context:
for context in all_context:
answer = call qamodel (question,context)
loaded_mdl = torch.load(“/location/finetuned_model.bin”)
answers, _ = loaded_mdl.predict(to_predict)
The latency of this QA model alone is 90 seconds out of total 95 seconds.
I tried to call this qamodel in threads so parallel processing can occur there by reducing the latency time.
But the latency time increased from 90 seconds to 240 seconds.
The code I tried:
answer = qamodel(context)
result =  with ThreadPoolExecutor(max_workers=20) as exe: print ("calling ThreadPoolExecutor") result = exe.map(parallel_qa,all_contexts)
How can I achieve multi processing to reduce the latency time?
Is there any other recommended approach to reduce latency time?
Any help appreciated. Thanks.