How to solve bottleneck of transferring data from cpu to gpu

Sentinel2405 · November 5, 2025, 9:10am

im running inference on a simple bert model on a fixed sample size of 1200 datapoints for a classification task

These are the total time taken in processing the entire dataset under different experiment scenarios

Note the time listed is the total sum of (time taken by each indvl batch)
A. Using DataLoader with pinned memory true (b=32)

Tokenization Time: 0:00:00.983126
CPU GPU Data Transfer Time: 0:00:00.338249
Forward Pass Time: 0:00:01.528260

B. Using DataLoader with pinned memory true (b=256)
## Total Time Taken with batch_size: 256 0:00:02.825832

Tokenization Time: 0:00:01.069770
CPU GPU Data Transfer Time: 0:00:01.219674
Forward Pass Time: 0:00:00.464684

**C. Using DataLoader with pinned memory true (b=1024)

Total Time Taken with batch_size: (1024) 0:00:03.747645**

Tokenization Time: 0:00:00.930315
CPU GPU Data Transfer Time: 0:00:02.331700
Forward Pass Time: 0:00:00.069030

im not understanding why does the total data transfer time(inputs.to(device)) is higher for a higher batch size
(ie. why does transfering data of size 1024 - 1 time to the gpu takes longer than transferring data of batch_size 32 - 32 times)

How do i solve this cpu bottleneck issue

Some info on the inference setup code:
```

for batch_inputs in tqdm(inference_loader):
   # print(‘#####’, batch_inputs)
   s = datetime.now()
   
   inputs = tokenizer(batch_inputs, return_tensors=‘pt’, padding=True, truncation=True, max_length=128)
   e = datetime.now()
   tokenization_time.append(e-s)

   s = datetime.now()
  
   inputs = inputs.to(device)
   e = datetime.now()
   datatransfer_time.append(e-s)


   s = datetime.now()
   logits = model(**inputs).logits
   e = datetime.now()
   inference_time.append(e-s)

   outputs.append(logits)
    
   outputs = torch.cat(outputs, dim=0)

   print(outputs.shape)
   results['labels'] = [id2label[indx] for indx in outputs.argmax(dim=-1).tolist()]
   results['scores'] = outputs.softmax(dim=-1).tolist() 
 from datetime import timedelta
 print("Tokenization Time: ", sum(tokenization_time, timedelta(0)))
 print("DataLoad Time: ", sum(datatransfer_time, timedelta(0)))
 print("Inference Time: ", sum(inference_time, timedelta(0)))

```

John6666 · November 5, 2025, 11:04am

This change worked around the issue (in Colab GPU).

Sentinel2405 · November 5, 2025, 8:55pm

thanks a lot for the explanation!

Sentinel2405 · November 5, 2025, 9:03pm

just a bit curious :).. the formatting of the readme file looks like a typical ai generated one but the depth of explanation and the synchronicity of reasoning in each part are held together very well that its almost scary if this was an ai generated one end to end

John6666 · November 5, 2025, 11:23pm

Ha ha. I manually remove afterwords (like “if you want*”) and format/merge things in Python, plus I chat with the bot a few times beforehand to give it some background knowledge.

The interactions go like this: I fill in missing subjects, mention relevant library names for bugs I know about, provide my limited knowledge or context from nearby Markdown files, have it search online for similar cases, give feedback on code errors, and then ask again.

Beyond these basic techniques, it’s seriously just plain GPT-5 Thinking used straight from the browser.

Rather than using the AI’s built-in knowledge, I’m essentially using it as a RAG system to synthesize vast search results. Probably the same approach would work with other models like Gemini.

Topic		Replies	Views
Dataloader fetches slowly using accelerator for distributed training 🤗Accelerate	0	1218	October 29, 2021
How to ensure fast inference on both CPU and GPU with BertForSequenceClassification? Beginners	5	5878	November 3, 2021
Make bert inference faster 🤗Transformers	6	11090	September 16, 2021
GPU inference slows down if done in a loop 🤗Transformers	1	1588	July 20, 2020
Bigger batch size, the lower throughput and GPU usage？ 🤗Transformers	1	641	July 16, 2022

How to solve bottleneck of transferring data from cpu to gpu

Total Time Taken with batch_size: (1024) 0:00:03.747645**

Related topics