Does batching in the standard question-answering pipeline provide a speedup?

Hi there.

I am using the QA pipeline and hoping to get speedup through batching. Basically what I want (this is pseudo code, not exact code; I would appreciate your guidance in doing it right):

       _classifier = pipeline("question-answering", model="deberta")
        result = _classifier(question=["What is the goal?","When did it happen?","Who did it?"], context="Once upon a time many years ago an engineer set up a question answering machine and it was hoped it would run really fast")

So the basic goal is to get this to work at roughly the same speed as a single question, assuming I give it a decent GPU capable of parallelizing the 3 questions.

I did search before asking this question I found a few others who have asked it in different ways but not yet answered:

Batched pipeline · Issue #6327 · huggingface/transformers · GitHub

[Benchmark] Pipeline for question answering · Issue #3007 · huggingface/transformers · GitHub

Many thanks for any pointers or help!

In my experience it doesn’t help.
Also, when you look at the code, you can see that once it encode all the given examples, it iterate over them one by one (transformers.pipelines — transformers 4.0.0 documentation):

all_answers = []
for features, example in zip(features_list, examples):
    model_input_names = self.tokenizer.model_input_names + ["input_ids"]
    fw_args = {k: [feature.__dict__[k] for feature in features] for k in model_input_names}

    # Manage tensor allocation on correct device
    with self.device_placement():
        if self.framework == "tf":
            fw_args = {k: tf.constant(v) for (k, v) in fw_args.items()}
            start, end = self.model(fw_args)[:2]
            start, end = start.numpy(), end.numpy()
        else:
            with torch.no_grad():
                # Retrieve the score for the context tokens only (removing question tokens)
                fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()}
                start, end = self.model(**fw_args)[:2]
                start, end = start.cpu().numpy(), end.cpu().numpy()