Sagemaker Text Summarization Fine Tuning Job failing

Hi Mighty HF community,

I am trying to build POC code for to fine tune the Text summarization model sshleifer/distilbart-cnn-12-6 using Sagemaker. Training job is completed successfully but I don’t see model.tar.gz file at destination location not any directory under /opt/ml. Appreciate any help you could provide? :slight_smile:

tokenizer_name = 'sshleifer/distilbart-cnn-12-6'
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
dataset_name = 'ccdv/cnn_dailymail'

I have used 5000 examples for training and 1000 examples for testing from ccdv/cnn_dailymail dataset. I have tokenized columns: 1/article and 2/highlights.


max_input_length = 512
max_target_length = 512

def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["article"], max_length=max_input_length, truncation=True
    )
    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["highlights"], max_length=max_target_length, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


test = load_dataset(dataset_name, '3.0.0')
tokenized_datasets = test.map(preprocess_function, batched=True)

train_dataset1 = test['train'].shuffle().select(range(5000))
test_dataset1 = test['train'].shuffle().select(range(1000))

train_dataset1_tokenized = train_dataset1.map(preprocess_function, batched=True)
test_dataset1_tokenized = test_dataset1.map(preprocess_function, batched=True)

train_dataset1_tokenized = train_dataset1_tokenized.remove_columns(['article', 'highlights'])
test_dataset1_tokenized = test_dataset1_tokenized.remove_columns(['article', 'highlights'])

train_dataset1_tokenized.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset1_tokenized.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Then I have uploaded this train and test datasets to S3 bucket.

s3 = S3FileSystem()
s3_prefix = f'samples/datasets/{dataset_name}'
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
#train_dataset.save_to_disk(training_input_path,fs=s3)
train_dataset1_tokenized.save_to_disk(training_input_path,fs=s3)
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
#test_dataset.save_to_disk(test_input_path,fs=s3)
test_dataset1_tokenized.save_to_disk(test_input_path,fs=s3)

print(f'Uploaded training data to {training_input_path}')
print(f'Uploaded testing data to {test_input_path}')

Hyper-parameter and estimator definition


hyperparameters={   'epochs': 1,
                    'train_batch_size': 32,
                    'model_name': model_name,
                    'tokenizer_name': tokenizer_name,
                 }

from sagemaker.huggingface import HuggingFace

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_summarization.py',
	source_dir='./examples/pytorch/summarization',
	instance_type='ml.p3.2xlarge',
	instance_count=2,
	role=role,
	git_config=git_config,
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
	hyperparameters = hyperparameters
)

Model fit


huggingface_estimator.fit(
            {'train': training_input_path, 'test': test_input_path}, 
            wait=False, 
            job_name='finetune-sshleifer-distilbart-cnn-12-6-2022-06-03-22-16-10' )

Hello @dineshmane,

All files which are saved to opt/ml/model will be uploaded to Amazon S3 as a model.tar.gz. Looking at you Hyperparameter you don’t provide set the output_dir to /opt/ml/model. That means transformers is saving them somewhere but not in the directory which is uploaded to S3.
Add "output_dir": "/opt/ml/model" to you hyperparameter then it should work

Thank you @philschmid for your response. I updated the hyper parameter to :

hyperparameters={'epochs': 1,
                 'train_batch_size': 32,
                 'model_name': model_name,
                 'tokenizer_name': tokenizer_name,
                 'output_dir':'/opt/ml/model',
                 }

Training job is successful but getting ValueError in cloud watch. “ValueError: Need either a dataset name or a training/validation file.”

Also directory /opt/ml/model didn’t create.

Could you check this blog post: Distributed Training: Train BART/T5 for Summarization using 🤗 Transformers and Amazon SageMaker
It is doing the same. There might be small more minor issues to your configuration, e.g. the name of the hyperparameter model_name and tokenizer_name is wrong similar to train_batch_size they don’t exist ins examples/ and you have to use per_device_train_batch_size and model_name_or_path.
Also if you want to use a dataset from S3 in the example you need to pass train_file and validation_file to the location where sagemaker stores it. You can find all requirements for the Summarization example here: transformers/examples/pytorch/summarization at main · huggingface/transformers · GitHub

Thank you @philschmid for your recommendation.
I updated the hyper-parameter and added train & test path like below

hyperparameters={'epochs': 1,
                 'per_device_train_batch_size': 32,
                 'per_device_eval_batch_size':4,
                'train_file': "/opt/ml/input/data/train/train.csv",
                'test_file': "/opt/ml/input/data/test/test.csv",
                 'model_name_or_path': model_name,
                 'tokenizer_name': tokenizer_name,
                 'output_dir':'/opt/ml/model',

                 }

tokenizer_name is parameter in the run_summarization.py.

Training job still failing with following error.

Failure reason


AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") ValueError: Some specified arguments are not used by the HfArgumentParser: ['--epochs', '1']" Command "/opt/conda/bin/python3.8 run_summarization.py --epochs 1 --model_name_or_path sshleifer-distilbart-cnn-12-6 --output_dir /opt/ml/model --per_device_eval_batch_size 4 --per_device_train_batch_size 32 --test_file /opt/ml/input/data/test/test.csv --tokenizer_name sshleifer/distilbart-cnn-12-6 --train_file /opt/ml/input/data/train/train.csv", exit code: 1

on cloud watch i get this error:
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")

Did i miss anything in hyper-parameter ?

You can find the Seq2SeqTrainingArguments here: Trainer
Those can be passed as hyperparameter. For example are you passing epochs which doesn’t exist it should be num_train_epochs.
Also make sure to pass do_train:True and do_eval:True otherwise the script won’t do much.

As good tip for next time you can simply run the script locally to if it works with python3 run_summarization.py --args value and if you have working version convert the --args value to a python dict.

P.S.

You don’t need to pass the tokenzier_name if it is the same for the model "Pretrained tokenizer name or path if not the same as model_name"

1 Like

Many thanks @philschmid for your explanation.
Suggestion to run the script on locally really helped to navigate quickly.

For other users, after satisfying the hyper-parameters requirement i faced another error:
RuntimeError: CUDA out of memory. Tried to allocate 548.00 MiB (GPU 0; 11.17 GiB total capacity; 10.11 GiB already allocated; 523.88 MiB free; 10.23 GiB reserved in total by PyTorch)

I changed the per_device_train_batch_size from 32 to 4 and it worked.

I can see the output.tar.gz file generated at S3 location. Now it’s deployment and inference time :melting_face:

Final code for rest of the folks like me:


hyperparameters={'num_train_epochs': 1,
                 'do_train': True,
                 'do_eval': False,
                 'per_device_train_batch_size': 4,
                 'per_device_eval_batch_size':4,
                'train_file': "/opt/ml/input/data/train/train.csv",
                'test_file': "/opt/ml/input/data/test/test.csv",
                 'model_name_or_path': model_name,
                 'tokenizer_name': tokenizer_name,
                 'output_dir':'/opt/ml/model',
                 }
# I saved the data in csv format 
training_input_path = "s3://sagemaker-eu-west-1-dinesh/samples/dinesh/train.csv"
test_input_path = "s3://sagemaker-eu-west-1-dinesh/samples/dinesh/test.csv"

data = {
    'train': training_input_path,
    'test': test_input_path
}

from sagemaker.huggingface import HuggingFace

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
                            entry_point='run_summarization.py',
                            source_dir='./examples/pytorch/summarization',
                            #instance_type='ml.p3.2xlarge',
                            #instance_type='ml.p2.xlarge',
                            instance_type='ml.p2.8xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.17', 
                            pytorch_version='1.10',
                            py_version='py38',
                            git_config=git_config,
                            hyperparameters = hyperparameters,
#                             metric_definitions=metric_definitions,
                            max_run=36000, # expected max run in seconds
                        )

huggingface_estimator.fit(data,
                          wait=False, 
                          job_name=training_job_name )
1 Like