EnvironmentError when calling push_to_hub()

Hello fine people of huggingface!

I’ve finetuned a BERT model on Sagemaker, roughly following the outline from this notebook

The training process itself completes without issues, but when the trainer.push_to_hub() (this line from train.py) method is called I’m getting the following error:

Error for Training job bert-target-sample-2022-05-02-08-35-04-2022-05-02-08-35-04-695: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise EnvironmentError(exc.stderr)
 OSError: remote: error: cannot lock ref 'refs/heads/main': is at bb0adcff5079411402a28f80eaa92ec8ae6ccbbd but expected 681a8a1fa9dd7b8bd61aecca1f5da47994f293e0         To https://huggingface.co/thusken/nb-bert-base-target-group  ! [remote rejected] main -> main (failed to update ref) error: failed to push some refs to 'https://user:{HF_TOKEN_REDACTED}@huggingface.co/thusken/nb-bert-base-target-group'"
Command "/opt/conda/bin/python3.8 train.py --epochs 1 --eval_batch_size 64 --hub_model_id nb-bert-base-target-group --hub_strategy every_save --hub_token {HF_TOKEN_REDACTED} --learning_rate 3e-05 --model_id NbAiLab/nb-bert-base --model_name_or_path NbAiLab/nb-bert-base --output_dir /opt/ml/model --push_to_hub True --train_batch_size 16", exit code: 1

Any idea what’s happening here?

This is probably because there is a push in progress (probably your last save) that finished after the push requested with trainer.push_to_hub(). Could you try to install from source? We have solved that particular bug recently.

I’m not quite sure if I understand what you mean by installing from source in the context of Sagemaker notebook instances. I’m using the following snippet to fit the model (where train.py refers to the script from my first post). If I’m not mistaken, this starts the job as a huggingface container inside a notebook instance, right? Where would I install transformers from source if I’m using a pre-built huggingface image?

hyperparameters = {
    'epochs': 1,                   # number of training epochs
    'train_batch_size': 16,                         # batch size for training
    'eval_batch_size': 64,                          # batch size for evaluation
    'learning_rate': 3e-5,          
    'push_to_hub': True,                            # Defines if we want to push the model to the hub
    'hub_model_id': 'nb-bert-base-target-group', # The model id of the model to push to the hub
    'hub_strategy': 'every_save',                   # The strategy to use when pushing the model to the hub
    'hub_token': HfFolder.get_token()               # HuggingFace token to have permission to push
	# add your remaining hyperparameters
	# more info here https://github.com/huggingface/transformers/tree/v4.17.0/examples/pytorch/text-classification

# git configuration to download our fine-tuning script

job_name = f'bert-target-sample-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# creates Hugging Face estimator
huggingface_estimator = sagemaker.huggingface.HuggingFace(
    entry_point          = 'train.py',        # fine-tuning script used in training jon
    source_dir           = './',       # directory where fine-tuning script is stored
    instance_type        = 'ml.p3.2xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version = '4.17',           # the transformers version used in the training job
	pytorch_version      = '1.10',
    py_version           = 'py38',            # the python version used in the training job
    hyperparameters      = hyperparameters,   # the hyperparameter used for running the training job

    'train': training_input_path,
    'test': test_input_path

Hey @thusken,

thanks for reporting this! We already have a PR open for releasing a new DLC with Transformers 4.18, which contains a fix to this: [huggingface_pytorch] Update Framework version to Pytorch 1.11 by philschmid · Pull Request #1824 · aws/deep-learning-containers · GitHub

In the meantime there are two workarounds you could do:

  1. add a time.sleep(100) in front before the line where you see the error, you can find an example here: huggingface-sagemaker-workshop-series/train.py at 345374941c55aa32f95d5993f7a7fc461e18f907 · philschmid/huggingface-sagemaker-workshop-series · GitHub
  2. Add a requirements.txt into your source_dir next to your train.py an include transformers==4.18.0. That way SageMaker would install transformers 4.18.0 before running you training.

Thanks for the input @philschmid!

Adding a time.sleep() call did the trick. I’ll look into bumping the transformers version to 4.18.0 next.

1 Like