Training script:
trainer = transformers.Trainer(
model=model,
train_dataset=data[“train”],
args=transformers.TrainingArguments(
per_device_train_batch_size=MICRO_BATCH_SIZE,
gradient_accumulation_steps = GRADIENT_ACCUMULATION_STEPS,
warmup_steps=100,
num_train_epochs=EPOCHS,
learning_rate=LEARNING_RATE,
fp16=True,
logging_steps=1,
output_dir=“lora-alpaca”,
save_total_limit=3,
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train(resume_from_checkpoint=False)
model.save_pretrained(“lora-alpaca”)
Issue:
1. In the first few runs following dirs are generated
a) checkpoint – 500
i. optimizer.pt
ii. pytorch_model.bin
iii. rng_state.pth
iv. scaler.pt
v. scheduler.pt
vi. trainer_state.json
vii. training_args.json
b) runs
i. Each inference run as a directory
c) adapter_confi.json
d) adapter_model.json
e) README.md
2. Later, with the same script but with different data, following dirs only generated
a) adapter_confi.json
b) adapter_model.json
c) README.md
Why checkpoint – 500 is not being generated?
PS: Is there any relation with save_total_limit?