Hi! I’m doing a zero-shot classification using the pipeline. I noticed that when input text is short (e.g. 10 words), the return to batched inference is very big: we have roughly doubled speed in batch size 2 vs no batching. However, when the inputs are longer (e.g. ~500 words), passing those texts sequentially has more or less the same speed as batched inference. Is it because that we have already “max out” the GPU’s computation power?
In other words, inferencing on sentences of 10 words in a batch size of 10 is equivalent to inferencing one single sentence of 100 words in batch size 1 (no batching). I’ve seen something similar here. Just want to ask if this is correct and if anyone else has similar experience?