Checkpoint-500 not being generated for llama-7b fine tuning

Training script:

trainer = transformers.Trainer(
model=model,
train_dataset=data[“train”],
args=transformers.TrainingArguments(
per_device_train_batch_size=MICRO_BATCH_SIZE,
gradient_accumulation_steps = GRADIENT_ACCUMULATION_STEPS,
warmup_steps=100,
num_train_epochs=EPOCHS,
learning_rate=LEARNING_RATE,
fp16=True,
logging_steps=1,
output_dir=“lora-alpaca”,
save_total_limit=3,
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train(resume_from_checkpoint=False)

model.save_pretrained(“lora-alpaca”)

Issue:

1. In the first few runs following dirs are generated
    a) checkpoint – 500
        i. optimizer.pt
        ii. pytorch_model.bin
        iii. rng_state.pth
        iv. scaler.pt
        v. scheduler.pt
        vi. trainer_state.json
        vii. training_args.json
    b) runs
        i. Each inference run as a directory
    c) adapter_confi.json
    d) adapter_model.json
    e) README.md
2. Later, with the same script but with different data, following dirs only generated
    a) adapter_confi.json
    b) adapter_model.json
    c) README.md

Why checkpoint – 500 is not being generated?

PS: Is there any relation with save_total_limit?

@sgugger @BramVanroy
Please help me out!

@shainaraza @SUNM
Please help me out!