I’m trying to train RoBERTa from scratch on proprietary dataset using the script from HF repo (https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py). When I run the training on machine with 32 cores and 8x V100, GPUs are not utilized in 100% all the time and it seems like there is a bottleneck on transfer between CPU and GPU. Even when I set number of workers in DataLoaders to 32, the performance does not change at all.
My batch size is 8 (max I could fit into 16GB V100 on Google Cloud), all examples have 512 tokens.
How improve GPU usage? Are there any additional parameters that need to be configured to utilize the GPU better?
I’ve also captured a few seconds of watch on nvidia-smi
output to give you a full picture:
https://1drv.ms/v/s!AkfjsmHCRwTChtVDsCiowniO7rMLSA?e=zjNAkF
I’m using the following model parameters:
{
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"type_vocab_size": 1,
"vocab_size": 32768
}