Execute finetuned QA model in parallel

Madhanlal · September 29, 2022, 2:25am

Hi
I am trying to build a QA model which gets a question from front end and pass it backs the answers. The latency time is 95 seconds right now which is very high for the online application.

Internally, my backend code fetches a maximum of 100 contexts related to the user’s question from a database and fetches the answers. I would take top 20 best answers based on business logic.

My current logic is looping thru each context:

answers =
for context in all_context:
answer = call qamodel (question,context)
answers.append(answer)

loaded_mdl = torch.load(“/location/finetuned_model.bin”)
def qamodel(question,context):
answers, _ = loaded_mdl.predict(to_predict)
return answers[0]

The latency of this QA model alone is 90 seconds out of total 95 seconds.
I tried to call this qamodel in threads so parallel processing can occur there by reducing the latency time.
But the latency time increased from 90 seconds to 240 seconds.

The code I tried:
def parallel_qa(context):
answer = qamodel(context)
return [answer]

    result = []
    with ThreadPoolExecutor(max_workers=20) as exe:
        print ("calling ThreadPoolExecutor")
        result = exe.map(parallel_qa,all_contexts)

How can I achieve multi processing to reduce the latency time?
Is there any other recommended approach to reduce latency time?
Any help appreciated. Thanks.

Topic		Replies	Views
How to improve model latency using quantization Beginners	0	317	December 27, 2021
Run parallel api inference for QA 🤗Transformers	0	311	November 19, 2021
[Help] GPU with query answering 🤗Transformers	0	328	November 25, 2020
Does batching in the standard question-answering pipeline provide a speedup? Intermediate	1	1471	December 13, 2021
Unable to Process Concurrent User Request Models	1	358	December 1, 2020

Execute finetuned QA model in parallel

Related topics