How can I get advantage using multi-GPUs

HyeyeonKoo · January 20, 2021, 2:05pm

Hello.
I try to train RoBERTa from scratch. I have several V100 GPUs. I already know that huggingface’s transformers automatically detect multi-gpu. And I checked it for myself in training log. But, there is something I couldn’t understand. There is no improvement performance between using single and multi GPUs. I experimented 3 cases, which are training same model with same batch size on single-GPU, 2-GPUs, 4-GPUs. There are code and training loss graph for comparing below.

from transformers import RobertaConfig
config = RobertaConfig(
    num_hidden_layers=4,    
    hidden_size=512,    
    hidden_dropout_prob=0.1,
    num_attention_heads=8,
    attention_probs_dropout_prob=0.1,    
    intermediate_size=2048,    
    vocab_size=34492,
    type_vocab_size=1,    
    initializer_range=0.02,
    max_position_embeddings=512,
    position_embedding_type="absolute"
)

from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("tokenizer", max_len=512)

from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config)

from transformers import LineByLineTextDataset
train_dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="train.txt",
    block_size=tokenizer.max_len_single_sentence
)

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

from transformers import Trainer, TrainingArguments
num_train_epochs = 4
max_steps = num_train_epochs * len(train_dataset)
warmup_steps = int(max_steps*0.05)

training_args = TrainingArguments(
    output_dir="output",
    overwrite_output_dir=True,
    
    do_train=True,
    max_steps=max_steps,
    warmup_steps=warmup_steps,
    num_train_epochs=num_train_epochs,

    per_device_train_batch_size=100,
    
    learning_rate=5e-5,
    
    weight_decay=0,
    max_grad_norm=1,
    
    adam_beta1=0.9,
    adam_beta2=0.98,
    adam_epsilon=1e-6,
    
#     disable_tqdm=True
    logging_dir="log",
    logging_first_step=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

trainer.train()

The dark blue line is using 4-GPUs, grey line is using 2-GPUs and sky blue line is using single-GPU. As the number of GPU increases, the number of steps(x-axis) are much smaller. I understand that the shape of the loss reduction is the same. However I couldn’t understand why multi-GPU’s training speed is more slower than single-GPU. If it is normal, How can I upgrade performance with multi-GPU in this code? Is there option I can tune?

valhalla · January 22, 2021, 7:02am

Hi @HyeyeonKoo

for multi-GPU training you need to launch the script using torch.distributed.launch, have a look at this section of example docs

thecity2 · January 29, 2021, 10:56pm

@valhalla Is it possible to use multiple GPU without using Trainer? If so, are there some examples somewhere by any chance? Thanks.

lewtun · January 30, 2021, 11:14am

Hi @thecity2, you can have a look at the scripts in the examples/legacy folder here: transformers/examples/legacy at master · huggingface/transformers · GitHub

For example, the run_squad.py script shows you how this is done: transformers/run_squad.py at master · huggingface/transformers · GitHub

thecity2 · February 3, 2021, 8:49pm

From what I can see, there are only a couple changes for doing multi-gpu training:

if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel (not distributed) training

Seems fairly straightforward unless I’m missing something else.

BramVanroy · February 3, 2021, 10:34pm

You’d probably also need to use distributed barriers in your code for things like downloading a pretrained model only once. You may also need to gather the results to a single process and do some processing or evaluation. I advise you to go through the code for a better understanding. But there is not one specific implementation necessary for transformers-specific multi-GPUs. Yoy can just use transformers models like any other pytorch/tensorflow model.

Topic		Replies	Views
RoBERTa training low GPU utilization 🤗Transformers	6	4014	July 3, 2021
Stucked on tokenization before training when using 3 GPU, but not when using 2 GPU Beginners	0	308	June 25, 2023
Getting different sentence embeddings when using model on CPU and GPU Beginners	0	2299	August 26, 2022
Multiple gpu not properly parallelized during model.generate() 🤗Transformers	4	1622	October 9, 2022
Mutli GPU freezes on Roberta Pretraining Beginners	6	2063	July 11, 2022

How can I get advantage using multi-GPUs

Related topics