High variability of CPU inference times

jekathe · January 30, 2025, 10:25am

Problem: the inference time on CPU for the same request can differ up to 2 times.

Details: I am running an NER pipeline using CPU. I tried with a proprietary model, its ONNX version, and with a public one (dslim/bert-base-NER). I’ve tested locally and on a Google VM. In all cases, speed of inference for the same request varies up to 2 times.

Code for measuring time:

import time

import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = 'dslim/bert-base-NER' # or proprietary model
pipe = pipeline('ner', model = model_name, tokenizer = model_name, device='cpu')
num_repeat = 1000
times = []
for _ in range(num_repeat):
    start = time.time()
    pipe(example_memo)
    times.append(time.time() - start)
print(np.min(times), np.mean(times), np.median(times), np.max(times), np.var(times))
plt.plot(list(range(num_repeat)), times, 'o')

An example plot (x-axis: the index number of the request, y-axis: time in seconds to process that request):

No other CPU-intensive processes are running.
The issue is not present when using GPU. Then it is, as expected, slower warm-up and then stable performance (can’t add a picture due to forum limitations).

Transformers version: 4.45.1
Python version: 3.11.1
OS: tried both Linux and Windows

Questions:

Is this a correct way to measure inference time in this case, or are there better approaches?
(if yes) Is this expected behaviour?
What are the ways to mitigate this? I’ve tried setting torch.set_num_threads(1) but that doesn’t help.

adhidevx369 · January 30, 2025, 11:58am

Alright, so here’s the deal—your timing method is mostly fine, but there are a few things that could make it more reliable.

1. Measuring inference time the right way

Swap time.time() for time.perf_counter()—it’s more precise.
Run a few warm-up inferences first—models take a moment to optimize.
Pin the process to a single CPU core to avoid OS scheduling randomness.
Use torch.inference_mode() if it’s a PyTorch model—it skips extra overhead.

2. Is it normal for inference time to vary this much?

Some variation is expected, but a 2x difference is pretty high. Possible reasons:

The OS shifts your process between CPU cores.
Your CPU changes speed dynamically (power-saving modes).
Caching effects—some runs might hit the CPU cache better than others.
PyTorch and ONNX optimize things differently, which could impact consistency.

On GPU, things are more stable since once it’s warmed up, performance evens out.

3. How to make it more stable

Here are some quick fixes:
Lock the process to one CPU core:

Linux: taskset -c 0 python script.py
Windows: Set CPU affinity in Task Manager

Disable CPU frequency scaling (Linux):

sudo cpufreq-set -g performance

Warm up the model before timing:

for _ in range(10):
    pipe(example_memo)

Use more precise timing:

import time
start = time.perf_counter()
pipe(example_memo)
elapsed = time.perf_counter() - start

Try ONNX Runtime if you haven’t—it might be more consistent than PyTorch.

Yeah, CPU inference is always a bit unpredictable, but these tweaks should help. If the problem persists, profiling (cProfile) can show where the slowdowns happen.

jekathe · January 30, 2025, 1:47pm

Thank you so much! I’ve applied all the changes you suggest, but unfortunately the issue is still present.

(I turned off power saving modes).

BUT! I just tried using KeyDataset, and even on just 1 text at a time, it removes the issue

jekathe · January 30, 2025, 1:48pm

A picture for comparison, when using KeyDataset:

system · January 31, 2025, 1:49am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Continuous execution lead to decreasing inference time Beginners	0	17	October 28, 2024
GPU inference slows down if done in a loop 🤗Transformers	1	1573	July 20, 2020
How to correctly measure inference time? Intermediate	0	935	July 25, 2022
Improve the performance of model prediction of transformers model 🤗Transformers	3	2621	November 24, 2021
Inference time gets slower as dataset size increase 🤗Transformers	0	433	February 23, 2023

High variability of CPU inference times

1. Measuring inference time the right way

2. Is it normal for inference time to vary this much?

3. How to make it more stable

Related topics