Trainer with load_best_model_at_end doesn't work

I was trying to save the checkpoint after each epoch by setting load_best_model_at_end=True with p3.16xlarge parallel training. However, the following error occurs:
OSError
Can’t load config for ‘bert_model/checkpoint-156’. Make sure that:

The error occurs only if the output_dir folder is empty but not occurs if there were checkpoints from last training. Does anyone face the same issue or have an idea on this?

What is the version of Transformers you are using? Also what were the training arguments for this training?

Transformers version is 4.3.3. The training arguments are:
–train_data_file sample_data.json
–test_data_file sample_data.json
–output_dir output
–tokenizer_name bert-base-uncased \
–model_name_or_path bert-base-uncased \
–per_device_train_batch_size 16 \
–per_device_eval_batch_size 16 \
–do_train \
–do_predict
–num_train_epochs 3 \
–learning_rate 0.00001 \
–logging_steps 100 \
–dataloader_num_workers 4 \
–evaluation_strategy epoch \
–overwrite_output_dir
–load_best_model_at_end True
–logging_dir logs

It’s probably a bug that has been fixed in the latest version, could you try again with v4.8.2?

I tried it with v4.8.2 but there was another error:
File “/home/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/trainer.py”, line 594, in _get_train_sampler
seed=self.args.seed,
TypeError: init() got an unexpected keyword argument ‘seed’
subprocess.CalledProcessError: Command ‘[’/home/anaconda3/envs/pytorch_p36/bin/python’, ‘-u’, ‘main.py’, ‘–local_rank=7’, ‘–train_file’, ‘…/sample_data.json’, ‘–validation_file’, ‘…/sample_data.json’, ‘–output_dir’, ‘output_roberta’, ‘–model_name_or_path’, ‘…/roberta_base’, ‘–per_device_train_batch_size’, ‘16’, ‘–per_device_eval_batch_size’, ‘16’, ‘–do_train’, ‘–num_train_epochs’, ‘1’, ‘–learning_rate’, ‘2e-5’, ‘–weight_decay’, ‘1e-4’, ‘–max_seq_length’, ‘128’, ‘–logging_steps’, ‘100’, ‘–load_best_model_at_end’, ‘True’, ‘–dataloader_num_workers’, ‘4’, ‘–evaluation_strategy’, ‘epoch’, ‘–overwrite_output_dir’, ‘–logging_dir’, ‘logs’]’ returned non-zero exit status 1.
My Trainer looks like:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer,
data_collator=data_collator.collate_batch
)
Do you know how to fix it?

Can you give the whole stack trace and not just the end? It’s hard to see where your issue comes from otherwise. Also please put it between two ``` to format it properly:
```
stack_trace
```