RoBERTa training low GPU utilization

marrrcin · September 18, 2020, 1:34pm

I’m trying to train RoBERTa from scratch on proprietary dataset using the script from HF repo (https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py). When I run the training on machine with 32 cores and 8x V100, GPUs are not utilized in 100% all the time and it seems like there is a bottleneck on transfer between CPU and GPU. Even when I set number of workers in DataLoaders to 32, the performance does not change at all.
My batch size is 8 (max I could fit into 16GB V100 on Google Cloud), all examples have 512 tokens.

How improve GPU usage? Are there any additional parameters that need to be configured to utilize the GPU better?

I’ve also captured a few seconds of watch on nvidia-smi output to give you a full picture:
https://1drv.ms/v/s!AkfjsmHCRwTChtVDsCiowniO7rMLSA?e=zjNAkF

I’m using the following model parameters:

{
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 32768
}

marrrcin · September 21, 2020, 8:47am

Bump - anyone?

BramVanroy · September 21, 2020, 9:46am

Are you using the linebylinedataset or not? The only thing I can think of is that you have an IO bottleneck; the GPUs process data faster than that the data can be read and tokenized on the fly. You can try profiling your script to see where the issue lies.

marrrcin · September 21, 2020, 11:27am

Yes, I’m using LineByLineTextDataset, which already pre-tokenizes the whole file at the very beginning. The only operations that are happening before the input to GPU are the ones in the data collator - which in this case is applying dynamic masking for MLM task.

In general, should the GPU utilization be 100% while using this script?

BramVanroy · September 21, 2020, 12:25pm

No, not necessarily. Think of it this way: if you are just browsing the internet, your computer is not using 100% CPU either, right? But still everything works flawlessly. It’s just that there is nothing more to do so the device doesn’t need to work harder. To give you an idea, I am training a model on a single GPU and it is going steady at around 60% CUDA usage. That is fine. In your case, you will likely see more fluctuations because it is a multi-GPU set-up in DDP where GPUs will have to wait for each other from time to time. That is normal.

As long as the performance is what you would expect, you are good to go.

marrrcin · September 21, 2020, 2:13pm

Seems reasonable, thanks @BramVanroy

sanjaysingh23 · July 3, 2021, 9:05am

have you got how to increase gpu utilization?

Topic		Replies	Views
How can I get advantage using multi-GPUs Beginners	5	3140	February 3, 2021
Stucked on tokenization before training when using 3 GPU, but not when using 2 GPU Beginners	0	308	June 25, 2023
Low RAM Usage & high GPU usage, Datasets not helping 🤗Datasets	3	1248	January 13, 2023
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 11.17 GiB total capacity; 10.62 GiB already allocated; 145.81 MiB free; 10.66 GiB reserved in total by PyTorch) Beginners	8	27435	December 10, 2023
Getting different sentence embeddings when using model on CPU and GPU Beginners	0	2296	August 26, 2022

RoBERTa training low GPU utilization

Related topics