Howdy!
My model was training:
trainer = Trainer(model=model, args=training_args, train_dataset=train_data, eval_dataset=val_data, callbacks=[SavePeftModelCallback, LoadBestPeftModelCallback])
And it gave me torch.cuda.OutOfMemoryError: CUDA out of memory.
right before saving the model. But I have the following strucuture:
checkpoints/
└── checkpoint-1000
├── adapter_config.json
├── adapter_model.bin
├── pytorch_model.bin
├── rng_state_1.pth
├── rng_state_2.pth
└── rng_state_3.pth
1 directory, 6 files
Did I lost everything, or I can recover the trained model from checkpoint-1000
? If so, how?