`flan-t5-xl` model does not appear to have a file named `pytorch_model.bin`

I am trying to fine-tune a flan-t5-xl model using run_summarization.py as the training script on Amazon SageMaker.

This is my main script:

from sagemaker.huggingface import HuggingFace

git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}

# hyperparameters, which are passed into the training job
hyperparameters={'per_device_train_batch_size': 8,
                 'per_device_eval_batch_size': 8,
                 'model_name_or_path': 'google/flan-t5-xl',
                 'dataset_name': 'samsum',
                 'do_train': True,
                 'do_eval': True,
                 'do_predict': True,
                 'predict_with_generate': True,
                 'output_dir': f'{output_location}/model',
                 'num_train_epochs': 1,
                 'learning_rate': 5e-5,
                 'seed': 7,
                 'fp16': True,
                 'max_source_length': 1153,
                 'max_target_length': 95,
                 'source_prefix': 'summarize: '
                 }

# create the Estimator
huggingface_estimator = HuggingFace(
      entry_point='run_summarization.py',
      source_dir='./examples/pytorch/summarization',
      git_config=git_config,
      code_location=code_location,
      instance_type='ml.g4dn.xlarge',
      instance_count=1,
      transformers_version='4.17',
      pytorch_version='1.10',
      py_version='py38',
      role=role,
      hyperparameters=hyperparameters,
      output_path=output_location
)

# starting the train job
huggingface_estimator.fit()

However, I get this error:

OSError: google/flan-t5-xl does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack."

It seems that the model is split into 2 chunks due to 10gb file size limit (pytorch_model-00001-of-00002.bin and pytorch_model-00002-of-00002.bin)

How could I approach this problem?

I have thought about downloading and merging the model files into a single pytorch_model.bin file and then specify the appropriate model path in ‘model_name_or_path’.
Would something like this work?:

cat pytorch_model-00001-of-00002.bin pytorch_model-00002-of-00002.bin > pytorch_model.bin

Or perhaps I can download the pytorch_model.bin file directly from somewhere?

1 Like

The issue is here that you are using transformers==4.17.0, which is not having support for sharded models. To fix this you just need to create a requirements.txt in the ./examples/pytorch/summarization directory if it doesn’t exist yet and then add transformers==4.25.1 there.

2 Likes

Thanks a lot for the answer @philschmid :slightly_smiling_face:

The directory is located at the transformers repository from hugginface
May I make a pull request in the v4.25-release branch and add transformers==4.25.1 ?

No this is not necessary. Since the scripts assumes the correct version. We are working on updating the container version that way no change is needed.

1 Like

Cool! :slightly_smiling_face:

Is there an approximate date for the update to be completed?

sometime in February. i ll let you know.

1 Like

I’m unable to use transformers==4.25.1

with

hfp = HuggingFaceProcessor(
    role=get_execution_role(), 
    instance_count=1,
    instance_type='ml.g4dn.2xlarge',
    transformers_version='4.25.1',
    pytorch_version='1.10.2', 
    base_job_name='frameworkprocessor-hf',
    py_version = 'py38'
)

Here is the error
`

ValueError: Unsupported huggingface version: 4.25.1. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.4.2, 4.5.0, 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.4, 4.5, 4.6, 4.10, 4.11, 4.12, 4.17.

`Any idea when we’ll be able to use transformers==4.25.1 in this way?

Can you upgrade your sagemaker-sdk version? Transformers 4.26.0 is now available.

I’m having a similar problem with declare-lab/flan-alpaca-xl.

There is a sharded model: pytorch_model-00001-of-00002.bin and pytorch_model-00002-of-00002.bin and I am using transformers==4.31.0 yet I am also getting this error:

OSError: Can't load weights for 'declare-lab/flan-alpaca-xl'. Make sure that:

- 'declare-lab/flan-alpaca-xl' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'declare-lab/flan-alpaca-xl' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

I can create a new post if its not advisable to restart this old thread.

Hey @wilkinsjle,

have you tried using the LLM Container yet? Introducing the Hugging Face LLM Inference Container for Amazon SageMaker

Nice, thanks for the suggestion. Sorry, I forgot to add that I was actually deploying locally without AWS Sagemaker, so my case is slightly different from the OP, but I’ll give this a look.