How to ensure fast inference on both CPU and GPU with BertForSequenceClassification?

Hi!

I’d like to perform fast inference using BertForSequenceClassification on both CPUs and GPUs.
For the purpose, I thought that torch DataLoaders could be useful, and indeed on GPU they are.

Given a set of sentences sents I encode them and employ a DataLoader as in

encoded_data_val = tokenizer.batch_encode_plus(sents, 
                                               add_special_tokens=True, 
                                               return_attention_mask=True, 
                                               padding='longest',
                                               truncation=True,
                                               max_length=256, 
                                               return_tensors='pt')

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']

dataset_val = TensorDataset(input_ids_val, attention_masks_val)
dataloader_val = DataLoader(dataset_val, sampler=SequentialSampler(dataset_val), batch_size=batch_size)

Afterwards, I perform inference on batches (using some value for batch_size) and retrieve softmax scores for my binary problem using

all_logits = np.empty([0,2])

for batch in dataloader_val:

    batch = tuple(b.to(device) for b in batch)
    inputs = {'input_ids': batch[0],
              'attention_mask': batch[1],
              }
    with torch.no_grad():        
        outputs = model(**inputs)

    logits = outputs[0]

    all_logits = np.vstack([all_logits, torch.softmax(logits, dim=1).detach().cpu().numpy()])

This works well and allows me to enjoy fast inference on GPU varying the batch_size.
However, on CPU the code above runs 2x slower than a simpler version without DataLoader:

all_logits2 = np.empty([0,2])

for sent in sents:
    input_ids = torch.tensor(tokenizer.encode(sent,
                                              add_special_tokens=True,
                                              return_attention_mask=False,
                                              padding='longest',
                                              truncation=True,
                                              max_length=256)).unsqueeze(0).to(device)  # Batch size 1

    labels = torch.tensor([1]).unsqueeze(0).to(device)  # Batch size 1
    outputs = model(input_ids, labels=labels)
    loss, logits = outputs[:2]
    all_logits2 = np.vstack([all_logits2, torch.softmax(logits, dim=1).detach().cpu().numpy()])

Based on my crude benchmarks, I should stick the “DataLoader” version above if I want to run faster on GPUs by playing with the batch size, and the “DataLoader-free” version if I am running on CPUs.

The behavior does reproduce on this colab notebook, running all cells on a CPU first and subsequently comparing on a GPU runtime: https://colab.research.google.com/gist/davidefiocco/4d738ef9d3b1976187086ea31ca25ed2/batch-bert.ipynb

Am I missing something obvious? Can I tweak my snippet using DataLoaders so that it doesn’t result in a speed penalty when running on CPUs?

Thanks!

I think use ONNX runtime run faster 2x on cpu. you can check my repo: https://github.com/BinhMinhs10/transformers_onnx or repo microsoft: https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/notebooks/PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb. And i note that notebook huggingface infer model by onnx still have bug :))

1 Like

Thanks! I was indeed thinking of ONNX as way to package the model and make inference fast!
Still, I am not sure if there is a way to have tweaks in the DataLoader code above to make it run optimally also on CPU (as the batches make the computation slower?) without ONNX.
I wonder if there are things I can improve in my code above before trying out ONNX.