Limit predictions computing to single CPU core?

I have a pre-trained model(TFDistilBertForSequenceClassification). I want to run the model predictions on a single CPU core to eliminate the battle for CPU resources among the running workers(a gunicorn server with multiple instances). TensorFlow has built-in configuration for limiting of resources to use(tf.config.threading.set_inter_op_parallelism_threads, tf.config.threading.set_intra_op_parallelism_threads), but I guess it doesn’t do anything for computations because it runs under the transformers wrapper. Are there any ways to implement what I want in transformers? Or, are there any options to increase throughput? I have a Flask server that runs classifications and exposes its API via HTTP. Currently, it takes ~200 ms for my testing example(585 characters) to do one prediction(depends on an input size). Since I can’t optimize the model(a different topic), I want to scale the computations.

model = TFDistilBertForSequenceClassification.from_pretrained("<path>")
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", num_labels=2)

encoded = tokenizer("text to classify", return_tensors=tensors_type, truncation=True)
outputs = model(**encoded)
predicted_label_classes = tf.nn.softmax(outputs.logits, axis=-1)
predictions = predicted_label_classes.numpy()[0]
1 Like

I found that doing the following at the top of my python code does properly limit transformers to only a single CPU core:

import os
import tensorflow as tf
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)
os.environ["OMP_NUM_THREADS"] = "1"
1 Like

I add this here just because I stumbled upon this thread but I was looking for a solution for PyTorch.
I was using multiprocessing and with the transformers TextClassificationPipeline my process was always running at 600-700% utilization and I wanted to improve this a bit.
I found the the following tidbit at Language Processing Pipelines · spaCy Usage Documentation :

import torch
torch.set_num_threads(1)

This reduced the per process CPU utilization to only 100% per process and did not impact my inference speed noticeably. Which also allowed me to again process more in parallel.