How to solve bottleneck of transferring data from cpu to gpu

im running inference on a simple bert model on a fixed sample size of 1200 datapoints for a classification task

These are the total time taken in processing the entire dataset under different experiment scenarios

Note the time listed is the total sum of (time taken by each indvl batch)
A. Using DataLoader with pinned memory true (b=32)

  1. Tokenization Time: 0:00:00.983126

  2. CPU GPU Data Transfer Time: 0:00:00.338249

  3. Forward Pass Time: 0:00:01.528260

B. Using DataLoader with pinned memory true (b=256)
## Total Time Taken with batch_size: 256 0:00:02.825832

  1. Tokenization Time: 0:00:01.069770

  2. CPU GPU Data Transfer Time: 0:00:01.219674

  3. Forward Pass Time: 0:00:00.464684

**C. Using DataLoader with pinned memory true (b=1024)

Total Time Taken with batch_size: (1024) 0:00:03.747645**

  1. Tokenization Time: 0:00:00.930315

  2. CPU GPU Data Transfer Time: 0:00:02.331700

  3. Forward Pass Time: 0:00:00.069030

im not understanding why does the total data transfer time(inputs.to(device)) is higher for a higher batch size
(ie. why does transfering data of size 1024 - 1 time to the gpu takes longer than transferring data of batch_size 32 - 32 times)

How do i solve this cpu bottleneck issue

Some info on the inference setup code:
```

for batch_inputs in tqdm(inference_loader):
   # print(ā€˜#####’, batch_inputs)
   s = datetime.now()
   
   inputs = tokenizer(batch_inputs, return_tensors=ā€˜pt’, padding=True, truncation=True, max_length=128)
   e = datetime.now()
   tokenization_time.append(e-s)

   s = datetime.now()
  
   inputs = inputs.to(device)
   e = datetime.now()
   datatransfer_time.append(e-s)


   s = datetime.now()
   logits = model(**inputs).logits
   e = datetime.now()
   inference_time.append(e-s)

   outputs.append(logits)
    
   outputs = torch.cat(outputs, dim=0)

   print(outputs.shape)
   results['labels'] = [id2label[indx] for indx in outputs.argmax(dim=-1).tolist()]
   results['scores'] = outputs.softmax(dim=-1).tolist() 
 from datetime import timedelta
 print("Tokenization Time: ", sum(tokenization_time, timedelta(0)))
 print("DataLoad Time: ", sum(datatransfer_time, timedelta(0)))
 print("Inference Time: ", sum(inference_time, timedelta(0)))

```
1 Like

This change worked around the issue (in Colab GPU).

thanks a lot for the explanation!

1 Like

just a bit curious :).. the formatting of the readme file looks like a typical ai generated one but the depth of explanation and the synchronicity of reasoning in each part are held together very well that its almost scary if this was an ai generated one end to end

1 Like

Ha ha. I manually remove afterwords (like ā€œif you want*ā€) and format/merge things in Python, plus I chat with the bot a few times beforehand to give it some background knowledge.

The interactions go like this: I fill in missing subjects, mention relevant library names for bugs I know about, provide my limited knowledge or context from nearby Markdown files, have it search online for similar cases, give feedback on code errors, and then ask again.

Beyond these basic techniques, it’s seriously just plain GPT-5 Thinking used straight from the browser.:sweat_smile:

Rather than using the AI’s built-in knowledge, I’m essentially using it as a RAG system to synthesize vast search results. Probably the same approach would work with other models like Gemini.

1 Like