🤗Trainer not saving after save_steps

I am using :hugs:Trainer for training. My training args are as follows:

    args = TrainingArguments(
        output_dir="bigbird-nq-output-dir",
        overwrite_output_dir=False,
        do_train=True,
        do_eval=True,
        evaluation_strategy="epoch",
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-5,
        num_train_epochs=3,
        logging_strategy="epoch",
        save_strategy="steps",
        run_name="bigbird-nq",
        disable_tqdm=False,
        load_best_model_at_end=True,
        report_to="wandb",
        remove_unused_columns=False,
        fp16=True,
    )

I am unable to find checkpoints after every 500 steps. Any reasons why??

With load_best_model_at_end=True, your save_strategy will be ignored and default to evaluation_strategy. So you will find one checkpoint at the end of each epoch.

1 Like

Gotta. Thanks a lot!