RoBERTa training low GPU utilization

I’m trying to train RoBERTa from scratch on proprietary dataset using the script from HF repo ( When I run the training on machine with 32 cores and 8x V100, GPUs are not utilized in 100% all the time and it seems like there is a bottleneck on transfer between CPU and GPU. Even when I set number of workers in DataLoaders to 32, the performance does not change at all.
My batch size is 8 (max I could fit into 16GB V100 on Google Cloud), all examples have 512 tokens.

How improve GPU usage? Are there any additional parameters that need to be configured to utilize the GPU better?

I’ve also captured a few seconds of watch on nvidia-smi output to give you a full picture:!AkfjsmHCRwTChtVDsCiowniO7rMLSA?e=zjNAkF

I’m using the following model parameters:

  "architectures": [
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 32768

Bump - anyone?

Are you using the linebylinedataset or not? The only thing I can think of is that you have an IO bottleneck; the GPUs process data faster than that the data can be read and tokenized on the fly. You can try profiling your script to see where the issue lies.

Yes, I’m using LineByLineTextDataset, which already pre-tokenizes the whole file at the very beginning. The only operations that are happening before the input to GPU are the ones in the data collator - which in this case is applying dynamic masking for MLM task.

In general, should the GPU utilization be 100% while using this script?

No, not necessarily. Think of it this way: if you are just browsing the internet, your computer is not using 100% CPU either, right? But still everything works flawlessly. It’s just that there is nothing more to do so the device doesn’t need to work harder. To give you an idea, I am training a model on a single GPU and it is going steady at around 60% CUDA usage. That is fine. In your case, you will likely see more fluctuations because it is a multi-GPU set-up in DDP where GPUs will have to wait for each other from time to time. That is normal.

As long as the performance is what you would expect, you are good to go.

1 Like

Seems reasonable, thanks @BramVanroy

have you got how to increase gpu utilization?