Data not passed to run_summarization.py

Hello!

I’m in the process of trying to finetune the NbAiLab/nb-t5-base model on my own dataset. The training job starts fine when calling huggingface_estimator.fit(), but gives a ValueError after the training image has been downloaded:

ValueError: Need either a dataset name or a training/validation file.

So it appears that the training data that is passed in huggingface_estimator.fit() is not
passed correctly to the run_summarization.py script. Any pointers? Simplified script is below.

CHECKPOINT = "NbAiLab/nb-t5-base"

train_dataset = Dataset.from_pandas(train_df)  # df with columns 'text' and 'labels'
test_dataset = Dataset.from_pandas(test_df)

tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)

def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)
   
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

s3 = S3FileSystem()  

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/data/train'
train_dataset.save_to_disk(training_input_path,fs=s3)

# save preprocessed test data to S3
test_input_path = f"s3://{session.default_bucket()}/data/test"
test_dataset.save_to_disk(test_input_path, fs=s3)

hyperparameters = {
	"model_name_or_path":CHECKPOINT,
	"output_dir":"/opt/ml/model",
        "epochs": 1,
        "train_batch_size": 32
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

# creates Hugging Face estimator
huggingface_estimator = sagemaker.huggingface.HuggingFace(
	entry_point='run_summarization.py',
	source_dir='./examples/pytorch/summarization',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
        checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints',
        use_spot_instances=True,
        max_wait=3600,
        max_run=1000,
	role=role,
	git_config=git_config,
	transformers_version='4.6.1',
	pytorch_version='1.7.1',
	py_version='py36',
	hyperparameters = hyperparameters
)

how are you calling fit() method?

So, looks like it was a typo causing the problems. I have used both sess.default_bucket() and sessìon.default_bucket(), while sess is never defined. My bad!

For reference, I called the fit method as follows:

#starting the train job
huggingface_estimator.fit(
    inputs={
        "train": training_input_path,
        "test": test_input_path
    }
)

Unfortunately, the problem persists even after fixing the typo’s in the default buckets.

Calling the .fit() method as follows still results in the same error. I have verified that training_input_path and test_input_path point to the correct location on S3 (the location that is passed to the .save_to_disk() call from two Dataset objects`)

huggingface_estimator.fit(
    {
        "train": training_input_path,
        "test": test_input_path
    }
)

This still results in the following ValueError, any help is still appreciated!

ValueError: Need either a dataset name or a training/validation file.

Looking at the source code for run_summarization.py, the error seems to be in the __post_init__ call of DataTrainingArguments here:

if self.dataset_name is None and self.train_file is None and self.validation_file is None:
            raise ValueError("Need either a dataset name or a training/validation file.")

So it looks like the data from S3 is not parsed to train_file, looking at the train_file argument it seems to expect a csv or json, rather than a location on S3

train_file: Optional[str] = field(
        default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
    )

Am I right in conluding from this that the run_summarization script cannot be used with data from S3?

If you are using run_summarization.py you need to tell it where to look. Sagemaker will load your s3 data into /opt/ml/input/data/train/whatever.csv or /opt/ml/input/data/test/whatever.csv, so pass in another parameter in the hyperparameters. ie

hyperparameters={
    'model_name_or_path': MODEL,
    'train_file': "/opt/ml/input/data/train/whatever_train.csv",
    'test_file': "/opt/ml/input/data/test/whatever_test.csv",
etc
2 Likes

Thanks for the tip! I was using the save_to_disk() method from the Dataset object which (I believe) writes the entire Dataset object to S3 as an .arrow file. I’ll try it out with a simple csv file.

1 Like

@thusken if you would like to use your dataset, which was saved with save_to_disk you would need to fork the examples/ script you use and replace the load_dataset method with load_from_disk method. [REF].

2 Likes

Good to know @philschmid ! I’ll try it out with the CSV approach first, but might give it a go with the load_from_disk method if that doesn’t work.

2 Likes