rlian
July 1, 2021, 3:42am
1
I was trying to save the checkpoint after each epoch by setting load_best_model_at_end=True with p3.16xlarge parallel training. However, the following error occurs:
OSError
Canât load config for âbert_model/checkpoint-156â. Make sure that:
The error occurs only if the output_dir folder is empty but not occurs if there were checkpoints from last training. Does anyone face the same issue or have an idea on this?
What is the version of Transformers you are using? Also what were the training arguments for this training?
rlian
July 1, 2021, 1:48pm
3
Transformers version is 4.3.3. The training arguments are:
âtrain_data_file sample_data.json
âtest_data_file sample_data.json
âoutput_dir output
âtokenizer_name bert-base-uncased \
âmodel_name_or_path bert-base-uncased \
âper_device_train_batch_size 16 \
âper_device_eval_batch_size 16 \
âdo_train \
âdo_predict
ânum_train_epochs 3 \
âlearning_rate 0.00001 \
âlogging_steps 100 \
âdataloader_num_workers 4 \
âevaluation_strategy epoch \
âoverwrite_output_dir
âload_best_model_at_end True
âlogging_dir logs
Itâs probably a bug that has been fixed in the latest version, could you try again with v4.8.2?
rlian
July 1, 2021, 4:39pm
5
I tried it with v4.8.2 but there was another error:
File â/home/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/trainer.pyâ, line 594, in _get_train_sampler
seed=self.args.seed,
TypeError: init () got an unexpected keyword argument âseedâ
subprocess.CalledProcessError: Command â[â/home/anaconda3/envs/pytorch_p36/bin/pythonâ, â-uâ, âmain.pyâ, ââlocal_rank=7â, ââtrain_fileâ, ââŚ/sample_data.jsonâ, ââvalidation_fileâ, ââŚ/sample_data.jsonâ, ââoutput_dirâ, âoutput_robertaâ, ââmodel_name_or_pathâ, ââŚ/roberta_baseâ, ââper_device_train_batch_sizeâ, â16â, ââper_device_eval_batch_sizeâ, â16â, ââdo_trainâ, âânum_train_epochsâ, â1â, ââlearning_rateâ, â2e-5â, ââweight_decayâ, â1e-4â, ââmax_seq_lengthâ, â128â, ââlogging_stepsâ, â100â, ââload_best_model_at_endâ, âTrueâ, ââdataloader_num_workersâ, â4â, ââevaluation_strategyâ, âepochâ, ââoverwrite_output_dirâ, ââlogging_dirâ, âlogsâ]â returned non-zero exit status 1.
My Trainer looks like:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer,
data_collator=data_collator.collate_batch
)
Do you know how to fix it?
Can you give the whole stack trace and not just the end? Itâs hard to see where your issue comes from otherwise. Also please put it between two ``` to format it properly:
```
stack_trace
```
mmukh
July 28, 2022, 7:57pm
7
Hi, were you able to fix this error?