I have a pre-trained model(TFDistilBertForSequenceClassification
). I want to run the model predictions on a single CPU core to eliminate the battle for CPU resources among the running workers(a gunicorn
server with multiple instances). TensorFlow has built-in configuration for limiting of resources to use(tf.config.threading.set_inter_op_parallelism_threads
, tf.config.threading.set_intra_op_parallelism_threads
), but I guess it doesn’t do anything for computations because it runs under the transformers
wrapper. Are there any ways to implement what I want in transformers
? Or, are there any options to increase throughput? I have a Flask server that runs classifications and exposes its API via HTTP. Currently, it takes ~200 ms for my testing example(585 characters) to do one prediction(depends on an input size). Since I can’t optimize the model(a different topic), I want to scale the computations.
model = TFDistilBertForSequenceClassification.from_pretrained("<path>")
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", num_labels=2)
encoded = tokenizer("text to classify", return_tensors=tensors_type, truncation=True)
outputs = model(**encoded)
predicted_label_classes = tf.nn.softmax(outputs.logits, axis=-1)
predictions = predicted_label_classes.numpy()[0]