Moving tokenizer outputs to CUDA taking way too long

loraxian · April 10, 2024, 2:03pm

I’m making a batch predict function with a model I trained. The issue is that after creating inputs with the tokenizer, moving the inputs to cuda takes an extremely long time. About 95% of the prediction function time is spent on this, and 2.5% on the actual prediction, so I feel like I must be doing something wrong. Here is the function:

class Predictor:
    def __init__(self, model_name, batch_size, max_input_length, binary_threshold):
        self.batch_size = batch_size
        self.max_input_length = max_input_length
        
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        self.binary_threshold = binary_threshold
        
    def predict(self, csv_file_path):
        
        start_time=datetime.datetime.now()
        
        print(os.path.basename(csv_file_path))
        
        df = pd.read_csv(csv_file_path)
        print(len(df))

        dataset = Dataset.from_pandas(df)
        
        
        num_batches = math.ceil(len(dataset)/self.batch_size)
        
        my_pb = widgets.IntProgress(
            value=0,
            min=0,
            max=num_batches,
            description='Loading:',
            bar_style= 'success',
            style={'bar_color': 'green'},
            orientation='horizontal'
        )
        
        display(my_pb)
        
        # Do the predictions, one batch at a time
        
        df['yes_probability'] = pd.Series(dtype='float')
        
        for batch in range(num_batches):

            # Tokenize the entire dataset
            input_df = df.loc[self.batch_size*batch:self.batch_size*(batch+1)-1, 'body'].tolist()
            inputs = self.tokenizer(input_df, return_tensors="pt", truncation=True, padding=True, max_length=self.max_input_length)
            inputs = inputs.to('cuda')

            # Make predictions        
            with torch.no_grad():
                logits = self.model(**inputs).logits

            # Get YES probabilities and add to original dataframe
            yes_probabilities = logits.softmax(dim=1)[:,1].to('cpu', non_blocking=True).tolist()
            df.loc[self.batch_size*batch:self.batch_size*(batch+1)-1, 'yes_probability'] = yes_probabilities
                        
            
            # Update progress bar
            my_pb.value = batch + 1
            
        output_df = df[df['yes_probability'] > self.binary_threshold]
        
        end_time = datetime.datetime.now()
        total_time = (end_time - start_time).total_seconds()
                
        time_per_item = round(total_time / (self.batch_size * num_batches), 4)
        
        print(f"{total_time} total seconds ({time_per_item} seconds per item)")
        
        print(f"{len(output_df)}/{len(df)} predicted as YES ({round(100*len(output_df)/len(df), 2)}%)")
        
        print()
        print()
        
        output_filename = os.path.basename(csv_file_path).split(os.extsep)[0] + ".csv"
        
        output_df.to_csv(f"/datasets/s3/simple_classifier_outputs/{output_filename}")

The line inputs = inputs.to('cuda') is what takes up 95% of the time, based on the library line_profiler.

swtb · April 10, 2024, 2:17pm

So I am unsure whether this will impact the .to(‘cuda’) time. But it will be a good data transfer optimisation to have anyways.

You should consider using a dataloader from pytorch as this allows you to use multiple workers to load data, you can also pin_memory which will yield faster CPU → GPU transfers, and you can use a data collator to tokenise your data (which I believe also leverages the multiple workers)

Datasets & DataLoaders — PyTorch Tutorials 2.2.1+cu121 documentation

As for choosing a worker amount, I have heard its good to use 4 per GPU. Any more or less may impact performance.

loraxian · April 10, 2024, 2:33pm

Thanks, I’ll try to learn about and implement those into the function.

swtb · April 10, 2024, 2:37pm

Note that all of the above optimisations are features of the dataloader that can be enabled. So you only need to implement the dataloader and just turn on the other features when you pass you dataset to it. If you need any support, let me know.

loraxian · April 11, 2024, 2:47am

I experimented some more, and I think it’s just a shortcoming of the line_profiler library in dealing with GPU operations. I completely separated the tokenizing/moving to CUDA part away from the prediction part. When I run those two parts separately, line_profiler shows the to('cuda') as being almost instant, but when I run the model(**inputs) it is about 50 times slower than when I had them together. So I guess it was just the prediction taking all that time all along, and something about the asynchronous nature makes it show it as the wrong line.

swtb · April 11, 2024, 9:00am

You could try enabling FP16 (mixed precision) as it optimises VRAM usage and increases data throughput to the model. Combined with the dataloader it should have a significant impact on performance.

CUDA Automatic Mixed Precision examples — PyTorch 2.2 documentation

loraxian · April 11, 2024, 1:38pm

I’m using a dataloader now. Before, the larger the inference dataset, the longer it would take per batch, so I think it was wasting time while retrieving batches the old way. Now it’s always the same, which is nice. Here’s my updated code:


df = pd.read_csv(csv_file_path)

dataset = Dataset.from_pandas(df)
        
dataloader = DataLoader(dataset, batch_size=self.batch_size)

# Create a column where the predictions will go
df['yes_probability'] = pd.Series(dtype='float')

for idx, batch in enumerate(dataloader):

            # Tokenize the entire dataset
            inputs = self.tokenizer(batch['body'], return_tensors="pt", truncation=True, padding=True, max_length=self.max_input_length)
            inputs = inputs.to('cuda')

            # Make predictions        
            with torch.no_grad():
                logits = self.model(**inputs).logits

            # Get YES probabilities and add to original dataframe
            yes_probabilities = logits.softmax(dim=1)[:,1].to('cpu').tolist()
            df.loc[self.batch_size*idx:self.batch_size*(idx+1)-1, 'yes_probability'] = yes_probabilities

I do have fp16 enabled during training.

I’ll still have to look into data collators and the pin_memory setting.

But since currently most of the time seems to be on predictions, I’m about to try a SetFit model which I just learned about to see if I can get the same or better predictions with a smaller, faster model.

swtb · April 11, 2024, 1:54pm

I think you are heading in the right direction. BTW for pinning memory you can just pass it as an argument to the Dataloader.

Dataloader(pin_memory=True, num_workers=4, collate_fn=collate_fn)

Topic		Replies	Views
Transformers Tokenizer on GPU? 🤗Transformers	3	15047	December 17, 2020
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2228	November 11, 2024
Tokenizer setting for model = LlamaForCausalLM.from_pretrained(model_path, device_map='auto') Models	0	1121	August 25, 2023
Tokenizer taking extremely long time to train 🤗Tokenizers	1	967	March 19, 2025
Tokenizer dataset is very slow 🤗Tokenizers	3	4301	March 2, 2024

Moving tokenizer outputs to CUDA taking way too long

Related topics