How to make single-input inference faster? Create my own pipeline?

Hello there!

I was able to fine-tune my own model for text classification (from a distilbert-base-uncased-finetuned-sst-2-english · Hugging Face model). Everything works correctly on my PC.

Now comes the app development time but inference - even on a single sentence - is quite slow. I am processing one sentence at a time and using the simple function predict_single_sentence(['this is my input sentence'])


tokenizer = AutoTokenizer.from_pretrained('C:\\Users\\mytrainedmodel')
mymodel = TFAutoModelForSequenceClassification.from_pretrained('C:\\Users\\mytrainedmodel')


def predict_single_sentence(text_list, model, tokenizer):  
    #tokenize the text
    encodings = tokenizer(text_list, 
                          max_length=280, 
                          truncation=True, 
                          padding=True)
    #transform to tf.Dataset
    dataset = tf.data.Dataset.from_tensor_slices((dict(encodings)))
    #predict
    preds = model.predict(dataset.batch(1)).logits  
    
    #transform to array with probabilities
    res = tf.nn.softmax(preds, axis=1).numpy()      
    
    return res

Despite my big RTX, inference is quite slow (about 5 sec for a single sentence). Am I missing something here? Should I create my own pipeline to speed things up?

Thanks!

1 Like

@sgugger sorry to pull you in, but I would really appreciate your input here. Am I doing things wrong? Is there anything huggingface provides that allows me to speed up inference? THank you so much!! :pray:

=> this is not necessary. Normally, you can directly provide the encodings to the model. Make sure to specify the additional argument return_tensors="tf" to get Tensorflow Tensors:

def predict_single_sentence(text_list, model, tokenizer):  
    # encode
    encoding = tokenizer(text_list, 
                         max_length=280, 
                         truncation=True, 
                         padding=True,
                         return_tensors="tf")
    # forward pass
    outputs = model(encoding)
    logits = outputs.logits
    
    # transform to array with probabilities
    probs = tf.nn.softmax(preds, axis=1).numpy() 
    
    return probs
1 Like

thank you so much, @nielsr ! This is super useful. I will try right away

This blog post about inference speed up techniques might be useful as well: Faster and smaller quantized NLP with Hugging Face and ONNX Runtime | by Yufeng Li | Microsoft Azure | Medium

2 Likes

@BramVanroy interesting indeed. It seems from the blog post that inference on small batches (1 or 4) is reduced by more than half when using GPU + ONNX. I am not familiar at all with the technique… I need to try. Have you already?

Hi @olaffson , I would like to add that batching on inference is often detrimental (on real loads, at least in PyTorch, less experience with TF). The reason is the padding of tokens.
The real game changer for GPU inference, is to do the processing (tokenizer) on a different thread than the inference on GPU, to keep it busy 100%. In PT land, it’s done with DataLoader, and I guess that’s tf.Dataset’s purpose too.

1 Like

thanks @NarsilTest but I am not sure to understand what you mean here. Could you please explain more? thanks!

Sorry for using my alt.

What I mean, is you need to check that you are using your GPU at 100% (nvidia-smi -l 1)

Could you instrument your function by printing times at each step, the result of the slowdown might come out clearer.

+1 to what @Narsil said, I was just going to suggest the golden rule of optimizing: measure first! Measure how long each piece of code takes, over a few runs with a few different configurations (input lengths, how many you’re predicting, that kind of thing). Then you’ll know where you can best focus your efforts.

I totally get that it’s annoying to measure. I also often drag my feet before doing this. But I’m always glad I did. Otherwise, you might spend a bunch of effort speeding up one part a tiny bit, when the bottleneck is actually somewhere else!