Hello.
I try to train RoBERTa from scratch. I have several V100 GPUs. I already know that huggingface’s transformers automatically detect multi-gpu. And I checked it for myself in training log. But, there is something I couldn’t understand. There is no improvement performance between using single and multi GPUs. I experimented 3 cases, which are training same model with same batch size on single-GPU, 2-GPUs, 4-GPUs. There are code and training loss graph for comparing below.
from transformers import RobertaConfig
config = RobertaConfig(
num_hidden_layers=4,
hidden_size=512,
hidden_dropout_prob=0.1,
num_attention_heads=8,
attention_probs_dropout_prob=0.1,
intermediate_size=2048,
vocab_size=34492,
type_vocab_size=1,
initializer_range=0.02,
max_position_embeddings=512,
position_embedding_type="absolute"
)
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("tokenizer", max_len=512)
from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config)
from transformers import LineByLineTextDataset
train_dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="train.txt",
block_size=tokenizer.max_len_single_sentence
)
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
from transformers import Trainer, TrainingArguments
num_train_epochs = 4
max_steps = num_train_epochs * len(train_dataset)
warmup_steps = int(max_steps*0.05)
training_args = TrainingArguments(
output_dir="output",
overwrite_output_dir=True,
do_train=True,
max_steps=max_steps,
warmup_steps=warmup_steps,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=100,
learning_rate=5e-5,
weight_decay=0,
max_grad_norm=1,
adam_beta1=0.9,
adam_beta2=0.98,
adam_epsilon=1e-6,
# disable_tqdm=True
logging_dir="log",
logging_first_step=True
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
)
trainer.train()
The dark blue line is using 4-GPUs, grey line is using 2-GPUs and sky blue line is using single-GPU. As the number of GPU increases, the number of steps(x-axis) are much smaller. I understand that the shape of the loss reduction is the same. However I couldn’t understand why multi-GPU’s training speed is more slower than single-GPU. If it is normal, How can I upgrade performance with multi-GPU in this code? Is there option I can tune?