Now comes the app development time but inference - even on a single sentence - is quite slow. I am processing one sentence at a time and using the simple function predict_single_sentence(['this is my input sentence'])
tokenizer = AutoTokenizer.from_pretrained('C:\\Users\\mytrainedmodel')
mymodel = TFAutoModelForSequenceClassification.from_pretrained('C:\\Users\\mytrainedmodel')
def predict_single_sentence(text_list, model, tokenizer):
#tokenize the text
encodings = tokenizer(text_list,
max_length=280,
truncation=True,
padding=True)
#transform to tf.Dataset
dataset = tf.data.Dataset.from_tensor_slices((dict(encodings)))
#predict
preds = model.predict(dataset.batch(1)).logits
#transform to array with probabilities
res = tf.nn.softmax(preds, axis=1).numpy()
return res
Despite my big RTX, inference is quite slow (about 5 sec for a single sentence). Am I missing something here? Should I create my own pipeline to speed things up?
@sgugger sorry to pull you in, but I would really appreciate your input here. Am I doing things wrong? Is there anything huggingface provides that allows me to speed up inference? THank you so much!!
=> this is not necessary. Normally, you can directly provide the encodings to the model. Make sure to specify the additional argument return_tensors="tf" to get Tensorflow Tensors:
@BramVanroy interesting indeed. It seems from the blog post that inference on small batches (1 or 4) is reduced by more than half when using GPU + ONNX. I am not familiar at all with the technique… I need to try. Have you already?
Hi @olaffson , I would like to add that batching on inference is often detrimental (on real loads, at least in PyTorch, less experience with TF). The reason is the padding of tokens.
The real game changer for GPU inference, is to do the processing (tokenizer) on a different thread than the inference on GPU, to keep it busy 100%. In PT land, it’s done with DataLoader, and I guess that’s tf.Dataset’s purpose too.
+1 to what @Narsil said, I was just going to suggest the golden rule of optimizing: measure first! Measure how long each piece of code takes, over a few runs with a few different configurations (input lengths, how many you’re predicting, that kind of thing). Then you’ll know where you can best focus your efforts.
I totally get that it’s annoying to measure. I also often drag my feet before doing this. But I’m always glad I did. Otherwise, you might spend a bunch of effort speeding up one part a tiny bit, when the bottleneck is actually somewhere else!