Moving tokenizer outputs to CUDA taking way too long

I’m making a batch predict function with a model I trained. The issue is that after creating inputs with the tokenizer, moving the inputs to cuda takes an extremely long time. About 95% of the prediction function time is spent on this, and 2.5% on the actual prediction, so I feel like I must be doing something wrong. Here is the function:

class Predictor:
    def __init__(self, model_name, batch_size, max_input_length, binary_threshold):
        self.batch_size = batch_size
        self.max_input_length = max_input_length
        
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        self.binary_threshold = binary_threshold
        
    def predict(self, csv_file_path):
        
        start_time=datetime.datetime.now()
        
        print(os.path.basename(csv_file_path))
        
        df = pd.read_csv(csv_file_path)
        print(len(df))

        dataset = Dataset.from_pandas(df)
        
        
        num_batches = math.ceil(len(dataset)/self.batch_size)
        
        my_pb = widgets.IntProgress(
            value=0,
            min=0,
            max=num_batches,
            description='Loading:',
            bar_style= 'success',
            style={'bar_color': 'green'},
            orientation='horizontal'
        )
        
        display(my_pb)
        
        # Do the predictions, one batch at a time
        
        df['yes_probability'] = pd.Series(dtype='float')
        
        for batch in range(num_batches):

            # Tokenize the entire dataset
            input_df = df.loc[self.batch_size*batch:self.batch_size*(batch+1)-1, 'body'].tolist()
            inputs = self.tokenizer(input_df, return_tensors="pt", truncation=True, padding=True, max_length=self.max_input_length)
            inputs = inputs.to('cuda')

            # Make predictions        
            with torch.no_grad():
                logits = self.model(**inputs).logits

            # Get YES probabilities and add to original dataframe
            yes_probabilities = logits.softmax(dim=1)[:,1].to('cpu', non_blocking=True).tolist()
            df.loc[self.batch_size*batch:self.batch_size*(batch+1)-1, 'yes_probability'] = yes_probabilities
                        
            
            # Update progress bar
            my_pb.value = batch + 1
            
        output_df = df[df['yes_probability'] > self.binary_threshold]
        
        end_time = datetime.datetime.now()
        total_time = (end_time - start_time).total_seconds()
                
        time_per_item = round(total_time / (self.batch_size * num_batches), 4)
        
        print(f"{total_time} total seconds ({time_per_item} seconds per item)")
        
        print(f"{len(output_df)}/{len(df)} predicted as YES ({round(100*len(output_df)/len(df), 2)}%)")
        
        print()
        print()
        
        output_filename = os.path.basename(csv_file_path).split(os.extsep)[0] + ".csv"
        
        output_df.to_csv(f"/datasets/s3/simple_classifier_outputs/{output_filename}")

The line inputs = inputs.to('cuda') is what takes up 95% of the time, based on the library line_profiler.

So I am unsure whether this will impact the .to(‘cuda’) time. But it will be a good data transfer optimisation to have anyways.

You should consider using a dataloader from pytorch as this allows you to use multiple workers to load data, you can also pin_memory which will yield faster CPU → GPU transfers, and you can use a data collator to tokenise your data (which I believe also leverages the multiple workers)

Datasets & DataLoaders — PyTorch Tutorials 2.2.1+cu121 documentation

As for choosing a worker amount, I have heard its good to use 4 per GPU. Any more or less may impact performance.

Thanks, I’ll try to learn about and implement those into the function.

Note that all of the above optimisations are features of the dataloader that can be enabled. So you only need to implement the dataloader and just turn on the other features when you pass you dataset to it. If you need any support, let me know.

1 Like

I experimented some more, and I think it’s just a shortcoming of the line_profiler library in dealing with GPU operations. I completely separated the tokenizing/moving to CUDA part away from the prediction part. When I run those two parts separately, line_profiler shows the to('cuda') as being almost instant, but when I run the model(**inputs) it is about 50 times slower than when I had them together. So I guess it was just the prediction taking all that time all along, and something about the asynchronous nature makes it show it as the wrong line.

You could try enabling FP16 (mixed precision) as it optimises VRAM usage and increases data throughput to the model. Combined with the dataloader it should have a significant impact on performance.

CUDA Automatic Mixed Precision examples — PyTorch 2.2 documentation

I’m using a dataloader now. Before, the larger the inference dataset, the longer it would take per batch, so I think it was wasting time while retrieving batches the old way. Now it’s always the same, which is nice. Here’s my updated code:


df = pd.read_csv(csv_file_path)

dataset = Dataset.from_pandas(df)
        
dataloader = DataLoader(dataset, batch_size=self.batch_size)

# Create a column where the predictions will go
df['yes_probability'] = pd.Series(dtype='float')

for idx, batch in enumerate(dataloader):

            # Tokenize the entire dataset
            inputs = self.tokenizer(batch['body'], return_tensors="pt", truncation=True, padding=True, max_length=self.max_input_length)
            inputs = inputs.to('cuda')

            # Make predictions        
            with torch.no_grad():
                logits = self.model(**inputs).logits

            # Get YES probabilities and add to original dataframe
            yes_probabilities = logits.softmax(dim=1)[:,1].to('cpu').tolist()
            df.loc[self.batch_size*idx:self.batch_size*(idx+1)-1, 'yes_probability'] = yes_probabilities

I do have fp16 enabled during training.

I’ll still have to look into data collators and the pin_memory setting.

But since currently most of the time seems to be on predictions, I’m about to try a SetFit model which I just learned about to see if I can get the same or better predictions with a smaller, faster model.

I think you are heading in the right direction. BTW for pinning memory you can just pass it as an argument to the Dataloader.

Dataloader(pin_memory=True, num_workers=4, collate_fn=collate_fn)
1 Like