Is my model saved?

seyyedaliayati · May 25, 2023, 10:16pm

Howdy!

My model was training:

trainer = Trainer(model=model, args=training_args, train_dataset=train_data, eval_dataset=val_data, callbacks=[SavePeftModelCallback, LoadBestPeftModelCallback])

And it gave me torch.cuda.OutOfMemoryError: CUDA out of memory. right before saving the model. But I have the following strucuture:


checkpoints/
└── checkpoint-1000
    ├── adapter_config.json
    ├── adapter_model.bin
    ├── pytorch_model.bin
    ├── rng_state_1.pth
    ├── rng_state_2.pth
    └── rng_state_3.pth

1 directory, 6 files

Did I lost everything, or I can recover the trained model from checkpoint-1000? If so, how?

Topic		Replies	Views
CUDA out of memory when using the trainer model_init 🤗Transformers	0	248	December 31, 2023
Loading model from checkpoint after error in training Beginners	9	41572	May 2, 2024
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 39.56 GiB total capacity; 37.84 GiB already allocated; 242.56 MiB free; 37.96 GiB reserved in total by PyTorch) 🤗Transformers	2	5342	June 7, 2023
Repeated training runs out of GPU memory 🤗Transformers	3	252	December 16, 2024
torch.cuda.OutOfMemoryError 🤗Transformers	0	2051	July 5, 2023

Is my model saved?

Related topics