I was trying to save the checkpoint after each epoch by setting load_best_model_at_end=True with p3.16xlarge parallel training. However, the following error occurs:
OSError
Canât load config for âbert_model/checkpoint-156â. Make sure that:
or âbert_model/checkpoint-156â is the correct path to a directory containing config.json file.
The error occurs only if the output_dir folder is empty but not occurs if there were checkpoints from last training. Does anyone face the same issue or have an idea on this?
Can you give the whole stack trace and not just the end? Itâs hard to see where your issue comes from otherwise. Also please put it between two ``` to format it properly:
```
stack_trace
```