How to ensure fast inference on both CPU and GPU with BertForSequenceClassification?


I’d like to perform fast inference using BertForSequenceClassification on both CPUs and GPUs.
For the purpose, I thought that torch DataLoaders could be useful, and indeed on GPU they are.

Given a set of sentences sents I encode them and employ a DataLoader as in

encoded_data_val = tokenizer.batch_encode_plus(sents, 

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']

dataset_val = TensorDataset(input_ids_val, attention_masks_val)
dataloader_val = DataLoader(dataset_val, sampler=SequentialSampler(dataset_val), batch_size=batch_size)

Afterwards, I perform inference on batches (using some value for batch_size) and retrieve softmax scores for my binary problem using

all_logits = np.empty([0,2])

for batch in dataloader_val:

    batch = tuple( for b in batch)
    inputs = {'input_ids': batch[0],
              'attention_mask': batch[1],
    with torch.no_grad():        
        outputs = model(**inputs)

    logits = outputs[0]

    all_logits = np.vstack([all_logits, torch.softmax(logits, dim=1).detach().cpu().numpy()])

This works well and allows me to enjoy fast inference on GPU varying the batch_size.
However, on CPU the code above runs 2x slower than a simpler version without DataLoader:

all_logits2 = np.empty([0,2])

for sent in sents:
    input_ids = torch.tensor(tokenizer.encode(sent,
                                              max_length=256)).unsqueeze(0).to(device)  # Batch size 1

    labels = torch.tensor([1]).unsqueeze(0).to(device)  # Batch size 1
    outputs = model(input_ids, labels=labels)
    loss, logits = outputs[:2]
    all_logits2 = np.vstack([all_logits2, torch.softmax(logits, dim=1).detach().cpu().numpy()])

Based on my crude benchmarks, I should stick the “DataLoader” version above if I want to run faster on GPUs by playing with the batch size, and the “DataLoader-free” version if I am running on CPUs.

The behavior does reproduce on this colab notebook, running all cells on a CPU first and subsequently comparing on a GPU runtime:

Am I missing something obvious? Can I tweak my snippet using DataLoaders so that it doesn’t result in a speed penalty when running on CPUs?



I think use ONNX runtime run faster 2x on cpu. you can check my repo: or repo microsoft: And i note that notebook huggingface infer model by onnx still have bug :))

1 Like

Thanks! I was indeed thinking of ONNX as way to package the model and make inference fast!
Still, I am not sure if there is a way to have tweaks in the DataLoader code above to make it run optimally also on CPU (as the batches make the computation slower?) without ONNX.
I wonder if there are things I can improve in my code above before trying out ONNX.

Hi! Maybe you might be interested in OpenVINO optimization for Transformers? It’s currently work-in-progress, but we are welcome for any feedback :slight_smile:

PR: Intel OpenVINO backend (inference only) by dkurt · Pull Request #14203 · huggingface/transformers · GitHub

basic usage:

from transformers import BertTokenizer, OVAutoModelForSequenceClassification

model_name_or_path = ""
tokenizer = BertTokenizer.from_pretrained(model_name_or_path)
model = OVAutoModelForSequenceClassification.from_pretrained(model_name_or_path, from_pt=True)

Hey there, thanks for this! I’ve been trying to use openvino but i’m having a hard time getting it setup. Is there a Dockerfile that you could point me to? Openvino keeps forcing me to “sign up” and get a license, etc, it’s really annoying :wink:

I would like to use OpenVino with transformers, please let me know if there are any docs and/or benchmarks comparing to to quantization with ONNX


1 Like

vgoklani, now OpenVINO is distributed by pip so it’s enough just install it by python3 -m pip install openvino

1 Like