Hi!
I’d like to perform fast inference using BertForSequenceClassification on both CPUs and GPUs.
For the purpose, I thought that torch DataLoaders could be useful, and indeed on GPU they are.
Given a set of sentences sents
I encode them and employ a DataLoader as in
encoded_data_val = tokenizer.batch_encode_plus(sents,
add_special_tokens=True,
return_attention_mask=True,
padding='longest',
truncation=True,
max_length=256,
return_tensors='pt')
input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
dataset_val = TensorDataset(input_ids_val, attention_masks_val)
dataloader_val = DataLoader(dataset_val, sampler=SequentialSampler(dataset_val), batch_size=batch_size)
Afterwards, I perform inference on batches (using some value for batch_size
) and retrieve softmax scores for my binary problem using
all_logits = np.empty([0,2])
for batch in dataloader_val:
batch = tuple(b.to(device) for b in batch)
inputs = {'input_ids': batch[0],
'attention_mask': batch[1],
}
with torch.no_grad():
outputs = model(**inputs)
logits = outputs[0]
all_logits = np.vstack([all_logits, torch.softmax(logits, dim=1).detach().cpu().numpy()])
This works well and allows me to enjoy fast inference on GPU varying the batch_size
.
However, on CPU the code above runs 2x slower than a simpler version without DataLoader:
all_logits2 = np.empty([0,2])
for sent in sents:
input_ids = torch.tensor(tokenizer.encode(sent,
add_special_tokens=True,
return_attention_mask=False,
padding='longest',
truncation=True,
max_length=256)).unsqueeze(0).to(device) # Batch size 1
labels = torch.tensor([1]).unsqueeze(0).to(device) # Batch size 1
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
all_logits2 = np.vstack([all_logits2, torch.softmax(logits, dim=1).detach().cpu().numpy()])
Based on my crude benchmarks, I should stick the “DataLoader” version above if I want to run faster on GPUs by playing with the batch size, and the “DataLoader-free” version if I am running on CPUs.
The behavior does reproduce on this colab notebook, running all cells on a CPU first and subsequently comparing on a GPU runtime: https://colab.research.google.com/gist/davidefiocco/4d738ef9d3b1976187086ea31ca25ed2/batch-bert.ipynb
Am I missing something obvious? Can I tweak my snippet using DataLoaders so that it doesn’t result in a speed penalty when running on CPUs?
Thanks!