Make bert inference faster

Hey everyone!

I’m currently using gbert from huggingface to do sentence similarity.
The dataset is nearly 3M
The encoding part is taking too long.

for sentence in list(data_dict.values()):
    tokens = {'input_ids': [], 'attention_mask': []}
    new_tokens = tokenizer.encode_plus(sentence, max_length=512,
                                       truncation=True, padding='max_length',
                                       return_tensors='pt',
                                       return_attention_mask=True)
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

    # reformat list of tensors into single tensor
    tokens['input_ids'] = torch.stack(tokens['input_ids'])
    tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

    outputs = model(**tokens) # takes too long
    embeddings = outputs[0]

Can someone advise me how to speed up this process? Is it possible to run

outputs = model(**tokens)

on GPU? Would converting the model into an onnx help?

Thank you!

Hi,

Looking at your code, you can already make it faster in two ways: by (1) batching the sentences and (2) by using a GPU, indeed.

Deep learning models are always trained in batches of examples, hence you can also use them at inference time on batches. The tokenizer also supports preparing several examples at a time.

Here’s a code example:

from transformers import BertTokenizer, BertForSequenceClassification
import torch

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

sentences = ["I like this movie a lot", "This movie is super bad"]
encoding = tokenizer(sentences, max_length=512, truncation=True, padding='max_length', return_tensors='pt')

# set device to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# put model on GPU (is done in-place)
model.to(device)

# put data on GPU
for k,v in encoding.items():
     encoding[k] = v.to(device)

# forward pass
outputs = model(**encoding)

# get predicted class indices
predicted_class_indices = outputs.logits.argmax(-1).tolist()
# turn into actual class names
predicted_classes = [model.config.id2label[label] for label in predicted_class_indices]
print(predicted_classes)

If you want to speed it up even more, then you can indeed look at converting your trained model to ONNX.

2 Likes

Thank you very much!

Switching to GPU already decreased the runtime.
Could you also point me to how to provide batches to the model for inference instead of a list of sentences?

Another thing that you can do is sort all the input sentences based on lenght, and then do batches. For example if you have lets say 10 sentences with 10 words long and 10 sentences 30 words long and you use batch size of 10, then you do the first batch with only sentences 10 words long and then next batch with longer sentences.

I believe this could hurt the accuracy, actually I am pretty sure it would but it is excellent for production, leads to around 30 percent speedup.

You use tokenizer.batch_encode_plus for batches.

tokenizer.encode and tokenizer.encode_plus shouldn’t be used anymore, actually. You can just call the tokenizer, both on a single input or on a batch of inputs:

# single input
sentences = "this movie is very good"
encoding = tokenizer(sentence, max_length=512, truncation=True, padding='max_length', return_tensors='pt')

# batch of inputs
sentences = ["I like this movie a lot", "This movie is super bad"]
encoding = tokenizer(sentences, max_length=512, truncation=True, padding='max_length', return_tensors='pt')

If you want to further improve inference time, you can convert it to ONNX, as explained in the docs.

Many things here have been said, and all are True, however I would like to weigh in on this:

  • Using the GPU will result in faster results most likely if you can use it

  • If you use a GPU, try to use a DataLoader and make the Dataset run the tokenization, this will make sure GPU is always busy

  • Be careful about batching on real data, if there’s too much padding involved it might actually decrease performance. (If you have no idea what’s in your data, I recommend batch_size=1 do avoid memory overflow + slower perf overall)

  • ONNX will improve speed on CPU, and can become very competitive to GPU but it’s unlikely to be more performant. (But it’s much easier to scale by adding more CPU for compute)

  • Make sure you are using Fast tokenizer and not the slow ones

  • Measure everything you do, and try to use not too simple use cases as it might not represent accurately what it going to happen on real data. (Use like the first 100 examples of real data over and over to get a better sense)

  • You can use pipeline to get probably a much easier to use API which will take care of most of this if you use a Dataset.

from transformers import pipeline                                              
from torch.utils.data import Dataset                                           
import tqdm                                                                    
                                                                               
data_dict = {i: "This is a test" if i % 2 else ("This is a longer test" * 20) for i in range(1000)}
                                                                               
                                                                               
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"                
                                                                               
pipe = pipeline("text-classification", model=model_name, device=0)             
                                                                               
                                                                               
class MyDataset(Dataset):                                                      
    def __init__(self, dic):                                                   
        self.dic = dic                                                         
                                                                               
    def __len__(self):                                                         
        return len(self.dic)                                                   
                                                                               
    def __getitem__(self, i):                                                  
        sentence = self.dic[i]                                                 
        return sentence                                                        
                                                                               
                                                                               
dataset = MyDataset(data_dict)                                                 
                                                                               
                                                                               
for outputs in tqdm.tqdm(pipe(dataset)):                                       
    pass       

Here is the full low-level example you can use to tweak parameters if batching makes sense in your use case:

from transformers import AutoTokenizer, AutoModel                                 
import torch                                                                      
from torch.utils.data import Dataset, DataLoader                                  
import tqdm                                                                       
                                                                                  
data_dict = {i: "This is a test" if i % 2 else ("This is a longer test" * 20) for i in range(1000)}   
                                                                                  
                                                                                  
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"                   
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)              
model = AutoModel.from_pretrained(model_name)                                     
device = torch.device("cuda:0")                                                   
model = model.to(device)                                                          
                                                                                  
                                                                                  
class MyDataset(Dataset):                                                         
    def __init__(self, tokenizer, dic):                                           
        self.tokenizer = tokenizer                                                
        self.dic = dic                                                            
                                                                                  
    def __len__(self):                                                            
        return len(self.dic)                                                      
                                                                                  
    def __getitem__(self, i):                                                     
        sentence = self.dic[i]                                                    
        tokens = tokenizer(sentence, return_tensors="pt")                         
        return tokens                                                             
                                                                                  
                                                                                  
dataset = MyDataset(tokenizer, data_dict)                                         
                                                                                  
                                                                                  
def collate_fn(batch):                                                            
    return tokenizer.pad([{k: v.squeeze(0) for k, v in item.items()} for item in batch])   
                                                                                  
                                                                                  
dataloader = DataLoader(dataset, batch_size=1, collate_fn=collate_fn, num_workers=4)                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                          
for tokens in tqdm.tqdm(dataloader):                                                                                                                                                                                                                                                      
    tokens = {k: v.to(device) for k, v in tokens.items()}                                                                                                                                                                                                                                 
    outputs = model(**tokens)  # takes too long                                                                                                                                                                                                                                           
    embeddings = outputs[0]