I am using the below -
args = TrainingArguments(
output_dir=f"./out_fold{i}",
overwrite_output_dir = 'True',
evaluation_strategy="steps",
eval_steps=40,
logging_steps = 40,
learning_rate = 5e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=10,
seed=0,
save_total_limit = 1,
# report_to = "none",
# logging_steps = 'epoch',
load_best_model_at_end=True,
save_strategy = "no"
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
# compute_metrics=compute_metrics,
# callbacks=[EarlyStoppingCallback(early_stopping_pa)],
)
trainer.train()
trainer.save_model(f'out_fold{i}')
Here, thought save_strategy = “no” , the checkpoints are saved at start in disk (as below) due to which disk goes full. Can you suggest what’s going wrong?
***** Running Evaluation *****
Num examples = 567
Batch size = 8
Saving model checkpoint to ./out_fold0/checkpoint-40
Configuration saved in ./out_fold0/checkpoint-40/config.json
Model weights saved in ./out_fold0/checkpoint-40/pytorch_model.bin
Deleting older checkpoint [out_fold0/checkpoint-760] due to args.save_total_limit
***** Running Evaluation *****