Hi all,
I came across a very similar issue trying to train the MLM RoBERTa model using the train function. The setup is the following:
roberta_config = RobertaConfig(
vocab_size=10000,
max_position_embeddings=256,
num_attention_heads=6,
num_hidden_layers=3,
type_vocab_size=1,
)
roberta = RobertaForMaskeLM(config=roberta_config)
training_args = TrainingArguments(
output_dir=output_path,
logging_dir=logging_path,
overwrite_output_dir=True,
num_train_epochs=1,
per_gpu_train_batch_size=256,
save_steps=1_000,
save_total_limit=2,
prediction_loss_only=False,
evaluation_strategy=EvaluationStrategy.EPOCH,
do_train=True,
do_eval=True,
evaluate_during_training=True,
logging_steps=10,
)
trainer = Trainer(
model=roberta,
args=training_args,
data_collator=data_collator,
train_dataset=smiles_training_dataset,
eval_dataset=smiles_eval_dataset,
)
trainer.train()
The error I get looks the same:
File "/lib/python3.8/site-packages/transformers/trainer.py", line 775, in train
tr_loss += self.training_step(model, inputs)
File "/home/matthias/anaconda3/envs/chemtran/lib/python3.8/site-packages/transformers/trainer.py", line 1126, in training_step
loss.backward()
File "/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 762.00 MiB (GPU 0; 7.93 GiB total capacity; 6.15 GiB already allocated; 340.06 MiB free; 6.94 GiB reserved in total by PyTorch)
It doesn’t appear imidiately though, but rather non-deterministicly far into the training, which rather points to a memory leak somewhere. Would you have some tips or ideas how to approach this?