I have a pre-trained model(
TFDistilBertForSequenceClassification). I want to run the model predictions on a single CPU core to eliminate the battle for CPU resources among the running workers(a
gunicorn server with multiple instances). TensorFlow has built-in configuration for limiting of resources to use(
tf.config.threading.set_intra_op_parallelism_threads), but I guess it doesn’t do anything for computations because it runs under the
transformers wrapper. Are there any ways to implement what I want in
transformers? Or, are there any options to increase throughput? I have a Flask server that runs classifications and exposes its API via HTTP. Currently, it takes ~200 ms for my testing example(585 characters) to do one prediction(depends on an input size). Since I can’t optimize the model(a different topic), I want to scale the computations.
model = TFDistilBertForSequenceClassification.from_pretrained("<path>") tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", num_labels=2) encoded = tokenizer("text to classify", return_tensors=tensors_type, truncation=True) outputs = model(**encoded) predicted_label_classes = tf.nn.softmax(outputs.logits, axis=-1) predictions = predicted_label_classes.numpy()