Let’s say I am finetuning a model and during training an error is encountered and the training stops. Let’s also say that, using Trainer
, I have it configured to save checkpoints along the way in training. How would I go about loading the model from the last checkpoint before it encountered the error?
For reference, here is the configuration of my Trainer
object:
TRAINER ARGS
args: TrainingArguments(
output_dir='models/textgen/out',
overwrite_output_dir=False,
do_train='True',
do_eval=False,
do_predict=False,
evaluate_during_training=False,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
per_gpu_train_batch_size=None,
per_gpu_eval_batch_size=None,
gradient_accumulation_steps=1,
learning_rate=5e-05,
weight_decay=0.0,
adam_epsilon=1e-08,
max_grad_norm=1.0,
num_train_epochs=3.0,
max_steps=-1,
warmup_steps=0,
logging_dir='models/textgen/logs',
logging_first_step=False,
logging_steps=500,
save_steps=500,
save_total_limit=None,
no_cuda=False,
seed=42,
fp16=False,
fp16_opt_level='O1',
local_rank=-1,
tpu_num_cores=None,
tpu_metrics_debug=False,
debug=False,
dataloader_drop_last=False,
eval_steps=1000,
past_index=-1)
data_collator: <function sd_data_collator at 0x7ffaba8f8e18>
train_dataset: <custom_dataset.SDAbstractsDataset object at 0x7ffa18c8c400>
eval_dataset: None
compute_metrics: None
prediction_loss_only: False
optimizers: None
tb_writer: <torch.utils.tensorboard.writer.SummaryWriter object at 0x7ff9f79e45c0>