Batched pipeline inference has little speed improvement on longer texts

harning · October 27, 2023, 1:28pm

Hi! I’m doing a zero-shot classification using the pipeline. I noticed that when input text is short (e.g. 10 words), the return to batched inference is very big: we have roughly doubled speed in batch size 2 vs no batching. However, when the inputs are longer (e.g. ~500 words), passing those texts sequentially has more or less the same speed as batched inference. Is it because that we have already “max out” the GPU’s computation power?

In other words, inferencing on sentences of 10 words in a batch size of 10 is equivalent to inferencing one single sentence of 100 words in batch size 1 (no batching). I’ve seen something similar here. Just want to ask if this is correct and if anyone else has similar experience?

harning · October 27, 2023, 2:42pm

You can find my code here:

import timeit

from transformers import pipeline


pipe = pipeline('zero-shot-classification', model='facebook/bart-large-mnli', device=0)
text = 'Hi! I’m now doing a zero-shot classification using the pipeline.'  # 10 words

# Batched inference speed of different input sentence length and batch size
for sentence_len in [10, 100, 500]:
    for batch_size in [1, 2, 4, 8]:
        time = timeit.timeit(lambda: pipe([text*(sentence_len//10)]*64,
                                          candidate_labels=['topic 1', 'topic 2'],
                                          batch_size=batch_size),
                             number=10)
        print(f'Sentence length: {sentence_len}, batch size: {batch_size}, time per sentence: {time/64:.3f}')

In Colab (free T4 GPU), I got the following results:

Sentence length: 10,  batch size: 1, time per sentence: 0.831
Sentence length: 10,  batch size: 2, time per sentence: 0.355
Sentence length: 10,  batch size: 4, time per sentence: 0.181
Sentence length: 10,  batch size: 8, time per sentence: 0.160
Sentence length: 100, batch size: 1, time per sentence: 1.013
Sentence length: 100, batch size: 2, time per sentence: 0.934
Sentence length: 100, batch size: 4, time per sentence: 0.875
Sentence length: 100, batch size: 8, time per sentence: 0.873
Sentence length: 500, batch size: 1, time per sentence: 5.567
Sentence length: 500, batch size: 2, time per sentence: 5.310
Sentence length: 500, batch size: 4, time per sentence: 4.730
Sentence length: 500, batch size: 8, time per sentence: 4.673

Topic		Replies	Views
Inference API response time scales linearly with number of inputs Beginners	0	263	November 1, 2021
How to batch a multiple sentences (4000+) Beginners	1	429	January 30, 2023
How to make single-input inference faster? Create my own pipeline? 🤗Transformers	9	3948	August 26, 2021
Batching on Vanilla CPU for Inference 🤗Transformers	0	314	July 17, 2023
Make bert inference faster 🤗Transformers	6	10833	September 16, 2021

Batched pipeline inference has little speed improvement on longer texts

Related topics