Limit predictions computing to single CPU core?

sergchr · October 16, 2021, 11:42am

I have a pre-trained model(TFDistilBertForSequenceClassification). I want to run the model predictions on a single CPU core to eliminate the battle for CPU resources among the running workers(a gunicorn server with multiple instances). TensorFlow has built-in configuration for limiting of resources to use(tf.config.threading.set_inter_op_parallelism_threads, tf.config.threading.set_intra_op_parallelism_threads), but I guess it doesn’t do anything for computations because it runs under the transformers wrapper. Are there any ways to implement what I want in transformers? Or, are there any options to increase throughput? I have a Flask server that runs classifications and exposes its API via HTTP. Currently, it takes ~200 ms for my testing example(585 characters) to do one prediction(depends on an input size). Since I can’t optimize the model(a different topic), I want to scale the computations.

model = TFDistilBertForSequenceClassification.from_pretrained("<path>")
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", num_labels=2)

encoded = tokenizer("text to classify", return_tensors=tensors_type, truncation=True)
outputs = model(**encoded)
predicted_label_classes = tf.nn.softmax(outputs.logits, axis=-1)
predictions = predicted_label_classes.numpy()[0]

Sembiance · December 1, 2022, 2:38pm

I found that doing the following at the top of my python code does properly limit transformers to only a single CPU core:

import os
import tensorflow as tf
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)
os.environ["OMP_NUM_THREADS"] = "1"

ekoerner · December 6, 2023, 10:19pm

I add this here just because I stumbled upon this thread but I was looking for a solution for PyTorch.
I was using multiprocessing and with the transformers TextClassificationPipeline my process was always running at 600-700% utilization and I wanted to improve this a bit.
I found the the following tidbit at Language Processing Pipelines · spaCy Usage Documentation :

import torch
torch.set_num_threads(1)

This reduced the per process CPU utilization to only 100% per process and did not impact my inference speed noticeably. Which also allowed me to again process more in parallel.

Topic		Replies	Views
How to make single-input inference faster? Create my own pipeline? 🤗Transformers	9	3944	August 26, 2021
M2 Max GPU utilization steadily dropping while running inference with huggingface distilbert-base-cased 🤗Transformers	2	1834	April 10, 2023
How to set batchsize of inference Beginners	1	294	October 17, 2024
Memory issues with model deployment Beginners	5	2815	July 16, 2021
How to use transformers&tensorflow for batch inference Beginners	0	524	August 20, 2021

Limit predictions computing to single CPU core?

Related topics