Hi everyone,
I’m pretty new to this. I’m trying to train a transformer model on a GPU using transformers.Trainer.
I’m doing my prototyping at home on a Windows 10 machine with a 4-core CPU with a 1060 gtx. I have my data, model, and trainer all set up, and my dataset is of type torch.utils.data.Dataset. Based on what I see in the task manager, it looks like Trainer might not be prefetching data to keep the GPU busy at all times. Here’s what I see during training:
As you can see, GPU usage maxes out around 55%, and cycles down to 0% regularly during training.
I can iterate through my dataset around 10x faster than it takes for Trainer to train the model for one epoch, so I don’t think data loading is the bottleneck.
So, any ideas why I am seeing this type of behavior with my GPU? Is Trainer not prefetching the data for each training step, or is it some other issue with my code?