`flan-t5-xl` model does not appear to have a file named `pytorch_model.bin`

I am trying to fine-tune a flan-t5-xl model using run_summarization.py as the training script on Amazon SageMaker.

This is my main script:

from sagemaker.huggingface import HuggingFace

git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}

# hyperparameters, which are passed into the training job
hyperparameters={'per_device_train_batch_size': 8,
                 'per_device_eval_batch_size': 8,
                 'model_name_or_path': 'google/flan-t5-xl',
                 'dataset_name': 'samsum',
                 'do_train': True,
                 'do_eval': True,
                 'do_predict': True,
                 'predict_with_generate': True,
                 'output_dir': f'{output_location}/model',
                 'num_train_epochs': 1,
                 'learning_rate': 5e-5,
                 'seed': 7,
                 'fp16': True,
                 'max_source_length': 1153,
                 'max_target_length': 95,
                 'source_prefix': 'summarize: '
                 }

# create the Estimator
huggingface_estimator = HuggingFace(
      entry_point='run_summarization.py',
      source_dir='./examples/pytorch/summarization',
      git_config=git_config,
      code_location=code_location,
      instance_type='ml.g4dn.xlarge',
      instance_count=1,
      transformers_version='4.17',
      pytorch_version='1.10',
      py_version='py38',
      role=role,
      hyperparameters=hyperparameters,
      output_path=output_location
)

# starting the train job
huggingface_estimator.fit()

However, I get this error:

OSError: google/flan-t5-xl does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack."

It seems that the model is split into 2 chunks due to 10gb file size limit (pytorch_model-00001-of-00002.bin and pytorch_model-00002-of-00002.bin)

How could I approach this problem?

I have thought about downloading and merging the model files into a single pytorch_model.bin file and then specify the appropriate model path in ‘model_name_or_path’.
Would something like this work?:

cat pytorch_model-00001-of-00002.bin pytorch_model-00002-of-00002.bin > pytorch_model.bin

Or perhaps I can download the pytorch_model.bin file directly from somewhere?

The issue is here that you are using transformers==4.17.0, which is not having support for sharded models. To fix this you just need to create a requirements.txt in the ./examples/pytorch/summarization directory if it doesn’t exist yet and then add transformers==4.25.1 there.

1 Like

Thanks a lot for the answer @philschmid :slightly_smiling_face:

The directory is located at the transformers repository from hugginface
May I make a pull request in the v4.25-release branch and add transformers==4.25.1 ?

No this is not necessary. Since the scripts assumes the correct version. We are working on updating the container version that way no change is needed.

1 Like

Cool! :slightly_smiling_face:

Is there an approximate date for the update to be completed?

sometime in February. i ll let you know.

1 Like