Hi,
I’m getting an error when trying to push my fine tuned model to the hub. Following the example notebook Spot instances in sagemaker I ran the training with these settings
- In the notebook:
hyperparameters={
...
'output_dir':'/opt/ml/checkpoints'
}
- Inside train.py
training_args = TrainingArguments(
output_dir=args.output_dir,
overwrite_output_dir=True if get_last_checkpoint(args.output_dir) is not None else False,
...
}
# Finetune the model
if get_last_checkpoint(args.output_dir) is not None:
logger.info("***** continue training *****")
last_checkpoint = get_last_checkpoint(args.output_dir)
trainer.train(resume_from_checkpoint=last_checkpoint)
else:
logger.info("***** start training *****")
trainer.train()
...
# Push to HuggingFace
kwargs = {
...
}
trainer.push_to_hub(**kwargs)
From the logs I can see that “model_dir”: “/opt/ml/model”
However when pushing the model to the hub I get an error
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/checkpoints/model.safetensors.sagemaker-uploading'
I can see thes files in my bucket after the job is completed:
config.json
model.safetensors
preprocessor_config.json
README.md
training_args.bin
It looks like at some point ‘.sagemaker-uploading’ is appended, I don’t know what causes this behavior and how to disable it.