Problem: the inference time on CPU for the same request can differ up to 2 times.
Details: I am running an NER pipeline using CPU. I tried with a proprietary model, its ONNX version, and with a public one (dslim/bert-base-NER). I’ve tested locally and on a Google VM. In all cases, speed of inference for the same request varies up to 2 times.
Code for measuring time:
import time
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = 'dslim/bert-base-NER' # or proprietary model
pipe = pipeline('ner', model = model_name, tokenizer = model_name, device='cpu')
num_repeat = 1000
times = []
for _ in range(num_repeat):
start = time.time()
pipe(example_memo)
times.append(time.time() - start)
print(np.min(times), np.mean(times), np.median(times), np.max(times), np.var(times))
plt.plot(list(range(num_repeat)), times, 'o')
An example plot (x-axis: the index number of the request, y-axis: time in seconds to process that request):
The issue is not present when using GPU. Then it is, as expected, slower warm-up and then stable performance (can’t add a picture due to forum limitations).
Transformers version: 4.45.1
Python version: 3.11.1
OS: tried both Linux and Windows
Questions:
Is this a correct way to measure inference time in this case, or are there better approaches?
(if yes) Is this expected behaviour?
What are the ways to mitigate this? I’ve tried setting torch.set_num_threads(1) but that doesn’t help.
Try ONNX Runtime if you haven’t—it might be more consistent than PyTorch.
Yeah, CPU inference is always a bit unpredictable, but these tweaks should help. If the problem persists, profiling (cProfile) can show where the slowdowns happen.