High variability of CPU inference times

Problem: the inference time on CPU for the same request can differ up to 2 times.

Details: I am running an NER pipeline using CPU. I tried with a proprietary model, its ONNX version, and with a public one (dslim/bert-base-NER). I’ve tested locally and on a Google VM. In all cases, speed of inference for the same request varies up to 2 times.

Code for measuring time:

import time

import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = 'dslim/bert-base-NER' # or proprietary model
pipe = pipeline('ner', model = model_name, tokenizer = model_name, device='cpu')
num_repeat = 1000
times = []
for _ in range(num_repeat):
    start = time.time()
    pipe(example_memo)
    times.append(time.time() - start)
print(np.min(times), np.mean(times), np.median(times), np.max(times), np.var(times))
plt.plot(list(range(num_repeat)), times, 'o')

An example plot (x-axis: the index number of the request, y-axis: time in seconds to process that request):

  1. No other CPU-intensive processes are running.
  2. The issue is not present when using GPU. Then it is, as expected, slower warm-up and then stable performance (can’t add a picture due to forum limitations).

Transformers version: 4.45.1
Python version: 3.11.1
OS: tried both Linux and Windows

Questions:

  1. Is this a correct way to measure inference time in this case, or are there better approaches?
  2. (if yes) Is this expected behaviour?
  3. What are the ways to mitigate this? I’ve tried setting torch.set_num_threads(1) but that doesn’t help.
1 Like

Alright, so here’s the deal—your timing method is mostly fine, but there are a few things that could make it more reliable.

1. Measuring inference time the right way

  • Swap time.time() for time.perf_counter()—it’s more precise.
  • Run a few warm-up inferences first—models take a moment to optimize.
  • Pin the process to a single CPU core to avoid OS scheduling randomness.
  • Use torch.inference_mode() if it’s a PyTorch model—it skips extra overhead.

2. Is it normal for inference time to vary this much?

Some variation is expected, but a 2x difference is pretty high. Possible reasons:

  • The OS shifts your process between CPU cores.
  • Your CPU changes speed dynamically (power-saving modes).
  • Caching effects—some runs might hit the CPU cache better than others.
  • PyTorch and ONNX optimize things differently, which could impact consistency.

On GPU, things are more stable since once it’s warmed up, performance evens out.

3. How to make it more stable

Here are some quick fixes:
:white_check_mark: Lock the process to one CPU core:

  • Linux: taskset -c 0 python script.py
  • Windows: Set CPU affinity in Task Manager

:white_check_mark: Disable CPU frequency scaling (Linux):

sudo cpufreq-set -g performance

:white_check_mark: Warm up the model before timing:

for _ in range(10):
    pipe(example_memo)

:white_check_mark: Use more precise timing:

import time
start = time.perf_counter()
pipe(example_memo)
elapsed = time.perf_counter() - start

:white_check_mark: Try ONNX Runtime if you haven’t—it might be more consistent than PyTorch.

Yeah, CPU inference is always a bit unpredictable, but these tweaks should help. If the problem persists, profiling (cProfile) can show where the slowdowns happen.

2 Likes

Thank you so much! I’ve applied all the changes you suggest, but unfortunately the issue is still present.

(I turned off power saving modes).

BUT! I just tried using KeyDataset, and even on just 1 text at a time, it removes the issue :slight_smile:

2 Likes

A picture for comparison, when using KeyDataset:

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.