The trainer args which is throwing out of memory error
training_args = TrainingArguments(
output_dir=“./vit-cifar10”,
per_device_train_batch_size=1,
evaluation_strategy=“epoch”,
save_strategy = “epoch”,
num_train_epochs=4,
fp16=True,
logging_steps=10000,
learning_rate=2e-4,
save_total_limit=2,
remove_unused_columns=False,
push_to_hub=False,
report_to=‘tensorboard’,
load_best_model_at_end=True,
)
The wroking trainer args
training_args = TrainingArguments(
output_dir=“./vit-cifar10”,
per_device_train_batch_size=32,
evaluation_strategy=“steps”,
num_train_epochs=1,
fp16=True,
save_steps=1000,
eval_steps=1000,
logging_steps=10,
learning_rate=2e-4,
save_total_limit=2,
remove_unused_columns=False,
push_to_hub=False,
report_to=‘tensorboard’,
load_best_model_at_end=True,
)
I’m not sure why it is going out of memory for the epoch strategy. I’m using a TeslaP40 with 24GB memory.