I am trying to fine-tune a flan-t5-xl
model using run_summarization.py as the training script on Amazon SageMaker.
This is my main script:
from sagemaker.huggingface import HuggingFace
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}
# hyperparameters, which are passed into the training job
hyperparameters={'per_device_train_batch_size': 8,
'per_device_eval_batch_size': 8,
'model_name_or_path': 'google/flan-t5-xl',
'dataset_name': 'samsum',
'do_train': True,
'do_eval': True,
'do_predict': True,
'predict_with_generate': True,
'output_dir': f'{output_location}/model',
'num_train_epochs': 1,
'learning_rate': 5e-5,
'seed': 7,
'fp16': True,
'max_source_length': 1153,
'max_target_length': 95,
'source_prefix': 'summarize: '
}
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point='run_summarization.py',
source_dir='./examples/pytorch/summarization',
git_config=git_config,
code_location=code_location,
instance_type='ml.g4dn.xlarge',
instance_count=1,
transformers_version='4.17',
pytorch_version='1.10',
py_version='py38',
role=role,
hyperparameters=hyperparameters,
output_path=output_location
)
# starting the train job
huggingface_estimator.fit()
However, I get this error:
OSError: google/flan-t5-xl does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack."
It seems that the model is split into 2 chunks due to 10gb file size limit (pytorch_model-00001-of-00002.bin
and pytorch_model-00002-of-00002.bin
)
How could I approach this problem?
I have thought about downloading and merging the model files into a single pytorch_model.bin
file and then specify the appropriate model path in ‘model_name_or_path’.
Would something like this work?:
cat pytorch_model-00001-of-00002.bin pytorch_model-00002-of-00002.bin > pytorch_model.bin
Or perhaps I can download the pytorch_model.bin
file directly from somewhere?