I am using Trainer for training. My training args are as follows:
args = TrainingArguments(
output_dir="bigbird-nq-output-dir",
overwrite_output_dir=False,
do_train=True,
do_eval=True,
evaluation_strategy="epoch",
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=5e-5,
num_train_epochs=3,
logging_strategy="epoch",
save_strategy="steps",
run_name="bigbird-nq",
disable_tqdm=False,
load_best_model_at_end=True,
report_to="wandb",
remove_unused_columns=False,
fp16=True,
)
I am unable to find checkpoints after every 500 steps. Any reasons why??