How to make single-input inference faster? Create my own pipeline?

olaffson · August 19, 2021, 11:58pm

Hello there!

I was able to fine-tune my own model for text classification (from a distilbert-base-uncased-finetuned-sst-2-english · Hugging Face model). Everything works correctly on my PC.

Now comes the app development time but inference - even on a single sentence - is quite slow. I am processing one sentence at a time and using the simple function predict_single_sentence(['this is my input sentence'])


tokenizer = AutoTokenizer.from_pretrained('C:\\Users\\mytrainedmodel')
mymodel = TFAutoModelForSequenceClassification.from_pretrained('C:\\Users\\mytrainedmodel')


def predict_single_sentence(text_list, model, tokenizer):  
    #tokenize the text
    encodings = tokenizer(text_list, 
                          max_length=280, 
                          truncation=True, 
                          padding=True)
    #transform to tf.Dataset
    dataset = tf.data.Dataset.from_tensor_slices((dict(encodings)))
    #predict
    preds = model.predict(dataset.batch(1)).logits  
    
    #transform to array with probabilities
    res = tf.nn.softmax(preds, axis=1).numpy()      
    
    return res

Despite my big RTX, inference is quite slow (about 5 sec for a single sentence). Am I missing something here? Should I create my own pipeline to speed things up?

Thanks!

olaffson · August 20, 2021, 3:58pm

@sgugger sorry to pull you in, but I would really appreciate your input here. Am I doing things wrong? Is there anything huggingface provides that allows me to speed up inference? THank you so much!!

nielsr · August 21, 2021, 8:50am

=> this is not necessary. Normally, you can directly provide the encodings to the model. Make sure to specify the additional argument return_tensors="tf" to get Tensorflow Tensors:

def predict_single_sentence(text_list, model, tokenizer):  
    # encode
    encoding = tokenizer(text_list, 
                         max_length=280, 
                         truncation=True, 
                         padding=True,
                         return_tensors="tf")
    # forward pass
    outputs = model(encoding)
    logits = outputs.logits
    
    # transform to array with probabilities
    probs = tf.nn.softmax(preds, axis=1).numpy() 
    
    return probs

olaffson · August 21, 2021, 2:07pm

thank you so much, @nielsr ! This is super useful. I will try right away

BramVanroy · August 22, 2021, 10:53am

This blog post about inference speed up techniques might be useful as well: Faster and smaller quantized NLP with Hugging Face and ONNX Runtime | by Yufeng Li | Microsoft Azure | Medium

olaffson · August 23, 2021, 12:56am

@BramVanroy interesting indeed. It seems from the blog post that inference on small batches (1 or 4) is reduced by more than half when using GPU + ONNX. I am not familiar at all with the technique… I need to try. Have you already?

NarsilTest · August 24, 2021, 1:46pm

Hi @olaffson , I would like to add that batching on inference is often detrimental (on real loads, at least in PyTorch, less experience with TF). The reason is the padding of tokens.
The real game changer for GPU inference, is to do the processing (tokenizer) on a different thread than the inference on GPU, to keep it busy 100%. In PT land, it’s done with DataLoader, and I guess that’s tf.Dataset’s purpose too.

olaffson · August 24, 2021, 1:52pm

thanks @NarsilTest but I am not sure to understand what you mean here. Could you please explain more? thanks!

Narsil · August 24, 2021, 3:11pm

Sorry for using my alt.

What I mean, is you need to check that you are using your GPU at 100% (nvidia-smi -l 1)

Could you instrument your function by printing times at each step, the result of the slowdown might come out clearer.

mbforbes · August 26, 2021, 6:08pm

+1 to what @Narsil said, I was just going to suggest the golden rule of optimizing: measure first! Measure how long each piece of code takes, over a few runs with a few different configurations (input lengths, how many you’re predicting, that kind of thing). Then you’ll know where you can best focus your efforts.

I totally get that it’s annoying to measure. I also often drag my feet before doing this. But I’m always glad I did. Otherwise, you might spend a bunch of effort speeding up one part a tiny bit, when the bottleneck is actually somewhere else!

Topic		Replies	Views
Make bert inference faster 🤗Transformers	6	10845	September 16, 2021
Model inference on tokenized dataset 🤗Datasets	2	6307	March 22, 2023
Inference speed between pipelines and Heads 🤗Transformers	0	313	April 3, 2023
Speeding up electra inference, multilabel classification 🤗Transformers	0	377	June 9, 2022
Auto Model for Sequence Classification take more than 20 minutes to classify a single sequence 🤗Transformers	3	252	March 7, 2024

How to make single-input inference faster? Create my own pipeline?

Related topics