How to ensure fast inference on both CPU and GPU with BertForSequenceClassification?

davidefiocco · October 22, 2020, 2:01pm

Hi!

I’d like to perform fast inference using BertForSequenceClassification on both CPUs and GPUs.
For the purpose, I thought that torch DataLoaders could be useful, and indeed on GPU they are.

Given a set of sentences sents I encode them and employ a DataLoader as in

encoded_data_val = tokenizer.batch_encode_plus(sents, 
                                               add_special_tokens=True, 
                                               return_attention_mask=True, 
                                               padding='longest',
                                               truncation=True,
                                               max_length=256, 
                                               return_tensors='pt')

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']

dataset_val = TensorDataset(input_ids_val, attention_masks_val)
dataloader_val = DataLoader(dataset_val, sampler=SequentialSampler(dataset_val), batch_size=batch_size)

Afterwards, I perform inference on batches (using some value for batch_size) and retrieve softmax scores for my binary problem using

all_logits = np.empty([0,2])

for batch in dataloader_val:

    batch = tuple(b.to(device) for b in batch)
    inputs = {'input_ids': batch[0],
              'attention_mask': batch[1],
              }
    with torch.no_grad():        
        outputs = model(**inputs)

    logits = outputs[0]

    all_logits = np.vstack([all_logits, torch.softmax(logits, dim=1).detach().cpu().numpy()])

This works well and allows me to enjoy fast inference on GPU varying the batch_size.
However, on CPU the code above runs 2x slower than a simpler version without DataLoader:

all_logits2 = np.empty([0,2])

for sent in sents:
    input_ids = torch.tensor(tokenizer.encode(sent,
                                              add_special_tokens=True,
                                              return_attention_mask=False,
                                              padding='longest',
                                              truncation=True,
                                              max_length=256)).unsqueeze(0).to(device)  # Batch size 1

    labels = torch.tensor([1]).unsqueeze(0).to(device)  # Batch size 1
    outputs = model(input_ids, labels=labels)
    loss, logits = outputs[:2]
    all_logits2 = np.vstack([all_logits2, torch.softmax(logits, dim=1).detach().cpu().numpy()])

Based on my crude benchmarks, I should stick the “DataLoader” version above if I want to run faster on GPUs by playing with the batch size, and the “DataLoader-free” version if I am running on CPUs.

The behavior does reproduce on this colab notebook, running all cells on a CPU first and subsequently comparing on a GPU runtime: https://colab.research.google.com/gist/davidefiocco/4d738ef9d3b1976187086ea31ca25ed2/batch-bert.ipynb

Am I missing something obvious? Can I tweak my snippet using DataLoaders so that it doesn’t result in a speed penalty when running on CPUs?

Thanks!

BinhMinhs10 · October 22, 2020, 6:55pm

I think use ONNX runtime run faster 2x on cpu. you can check my repo: https://github.com/BinhMinhs10/transformers_onnx or repo microsoft: https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/notebooks/PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb. And i note that notebook huggingface infer model by onnx still have bug :))

davidefiocco · October 22, 2020, 9:51pm

Thanks! I was indeed thinking of ONNX as way to package the model and make inference fast!
Still, I am not sure if there is a way to have tweaks in the DataLoader code above to make it run optimally also on CPU (as the batches make the computation slower?) without ONNX.
I wonder if there are things I can improve in my code above before trying out ONNX.

dkurt · November 2, 2021, 6:51pm

Hi! Maybe you might be interested in OpenVINO optimization for Transformers? It’s currently work-in-progress, but we are welcome for any feedback

PR: Intel OpenVINO backend (inference only) by dkurt · Pull Request #14203 · huggingface/transformers · GitHub

basic usage:

from transformers import BertTokenizer, OVAutoModelForSequenceClassification

model_name_or_path = ""
tokenizer = BertTokenizer.from_pretrained(model_name_or_path)
model = OVAutoModelForSequenceClassification.from_pretrained(model_name_or_path, from_pt=True)

vgoklani · November 2, 2021, 6:57pm

Hey there, thanks for this! I’ve been trying to use openvino but i’m having a hard time getting it setup. Is there a Dockerfile that you could point me to? Openvino keeps forcing me to “sign up” and get a license, etc, it’s really annoying

I would like to use OpenVino with transformers, please let me know if there are any docs and/or benchmarks comparing to to quantization with ONNX

Thanks!

dkurt · November 3, 2021, 7:58am

vgoklani, now OpenVINO is distributed by pip so it’s enough just install it by python3 -m pip install openvino

Topic		Replies	Views
Make bert inference faster 🤗Transformers	6	11081	September 16, 2021
Advice to speed and performance 🤗Transformers	4	7252	December 7, 2020
How to make single-input inference faster? Create my own pipeline? 🤗Transformers	9	3976	August 26, 2021
How load a Bert model from Onnx Runtime? 🤗Transformers	0	2279	July 14, 2021
Slow speed when using a fine-tuned bert for prediction Beginners	0	2177	March 26, 2022

How to ensure fast inference on both CPU and GPU with BertForSequenceClassification?

Related topics