Trainer with load_best_model_at_end doesn't work

rlian · July 1, 2021, 3:42am

I was trying to save the checkpoint after each epoch by setting load_best_model_at_end=True with p3.16xlarge parallel training. However, the following error occurs:
OSError
Can’t load config for ‘bert_model/checkpoint-156’. Make sure that:

‘bert_model/checkpoint-156’ is a correct model identifier listed on ‘Hugging Face – The AI community building the future.’
or ‘bert_model/checkpoint-156’ is the correct path to a directory containing config.json file.

The error occurs only if the output_dir folder is empty but not occurs if there were checkpoints from last training. Does anyone face the same issue or have an idea on this?

sgugger · July 1, 2021, 12:41pm

What is the version of Transformers you are using? Also what were the training arguments for this training?

rlian · July 1, 2021, 1:48pm

Transformers version is 4.3.3. The training arguments are:
–train_data_file sample_data.json
–test_data_file sample_data.json
–output_dir output
–tokenizer_name bert-base-uncased \
–model_name_or_path bert-base-uncased \
–per_device_train_batch_size 16 \
–per_device_eval_batch_size 16 \
–do_train \
–do_predict
–num_train_epochs 3 \
–learning_rate 0.00001 \
–logging_steps 100 \
–dataloader_num_workers 4 \
–evaluation_strategy epoch \
–overwrite_output_dir
–load_best_model_at_end True
–logging_dir logs

sgugger · July 1, 2021, 2:20pm

It’s probably a bug that has been fixed in the latest version, could you try again with v4.8.2?

rlian · July 1, 2021, 4:39pm

I tried it with v4.8.2 but there was another error:
File “/home/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/trainer.py”, line 594, in _get_train_sampler
seed=self.args.seed,
TypeError: init() got an unexpected keyword argument ‘seed’
subprocess.CalledProcessError: Command ‘[’/home/anaconda3/envs/pytorch_p36/bin/python’, ‘-u’, ‘main.py’, ‘–local_rank=7’, ‘–train_file’, ‘…/sample_data.json’, ‘–validation_file’, ‘…/sample_data.json’, ‘–output_dir’, ‘output_roberta’, ‘–model_name_or_path’, ‘…/roberta_base’, ‘–per_device_train_batch_size’, ‘16’, ‘–per_device_eval_batch_size’, ‘16’, ‘–do_train’, ‘–num_train_epochs’, ‘1’, ‘–learning_rate’, ‘2e-5’, ‘–weight_decay’, ‘1e-4’, ‘–max_seq_length’, ‘128’, ‘–logging_steps’, ‘100’, ‘–load_best_model_at_end’, ‘True’, ‘–dataloader_num_workers’, ‘4’, ‘–evaluation_strategy’, ‘epoch’, ‘–overwrite_output_dir’, ‘–logging_dir’, ‘logs’]’ returned non-zero exit status 1.
My Trainer looks like:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer,
data_collator=data_collator.collate_batch
)
Do you know how to fix it?

sgugger · July 1, 2021, 10:01pm

Can you give the whole stack trace and not just the end? It’s hard to see where your issue comes from otherwise. Also please put it between two ``` to format it properly:
```
stack_trace
```

mmukh · July 28, 2022, 7:57pm

Hi, were you able to fix this error?

Topic		Replies	Views
A very strange error when saving the checkpoint Beginners	1	540	January 24, 2024
Trainer "load_best_model_at_end" doesn't load the best model Intermediate	0	2550	February 21, 2023
Unexpected behavior of load_best_model_at_end in Trainer (or am I doing it wrong?) 🤗Transformers	2	50	March 25, 2025
Question Regarding trainer arguments:: load_best_model_at_end Beginners	2	1946	April 19, 2021
How to load model after running Trainer.save_model? Beginners	3	3118	November 28, 2023

Trainer with load_best_model_at_end doesn't work

Related topics