Hello!
I’m in the process of trying to finetune the NbAiLab/nb-t5-base
model on my own dataset. The training job starts fine when calling huggingface_estimator.fit()
, but gives a ValueError
after the training image has been downloaded:
ValueError: Need either a dataset name or a training/validation file.
So it appears that the training data that is passed in huggingface_estimator.fit()
is not
passed correctly to the run_summarization.py
script. Any pointers? Simplified script is below.
CHECKPOINT = "NbAiLab/nb-t5-base"
train_dataset = Dataset.from_pandas(train_df) # df with columns 'text' and 'labels'
test_dataset = Dataset.from_pandas(test_df)
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
def tokenize(batch):
return tokenizer(batch['text'], padding='max_length', truncation=True)
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
s3 = S3FileSystem()
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/data/train'
train_dataset.save_to_disk(training_input_path,fs=s3)
# save preprocessed test data to S3
test_input_path = f"s3://{session.default_bucket()}/data/test"
test_dataset.save_to_disk(test_input_path, fs=s3)
hyperparameters = {
"model_name_or_path":CHECKPOINT,
"output_dir":"/opt/ml/model",
"epochs": 1,
"train_batch_size": 32
}
# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}
# creates Hugging Face estimator
huggingface_estimator = sagemaker.huggingface.HuggingFace(
entry_point='run_summarization.py',
source_dir='./examples/pytorch/summarization',
instance_type='ml.p3.2xlarge',
instance_count=1,
checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints',
use_spot_instances=True,
max_wait=3600,
max_run=1000,
role=role,
git_config=git_config,
transformers_version='4.6.1',
pytorch_version='1.7.1',
py_version='py36',
hyperparameters = hyperparameters
)
how are you calling fit()
method?
So, looks like it was a typo causing the problems. I have used both sess.default_bucket()
and sessìon.default_bucket()
, while sess
is never defined. My bad!
For reference, I called the fit method as follows:
#starting the train job
huggingface_estimator.fit(
inputs={
"train": training_input_path,
"test": test_input_path
}
)
Unfortunately, the problem persists even after fixing the typo’s in the default buckets.
Calling the .fit()
method as follows still results in the same error. I have verified that training_input_path
and test_input_path
point to the correct location on S3 (the location that is passed to the .save_to_disk()
call from two Dataset
objects`)
huggingface_estimator.fit(
{
"train": training_input_path,
"test": test_input_path
}
)
This still results in the following ValueError
, any help is still appreciated!
ValueError: Need either a dataset name or a training/validation file.
Looking at the source code for run_summarization.py
, the error seems to be in the __post_init__
call of DataTrainingArguments
here:
if self.dataset_name is None and self.train_file is None and self.validation_file is None:
raise ValueError("Need either a dataset name or a training/validation file.")
So it looks like the data from S3 is not parsed to train_file
, looking at the train_file
argument it seems to expect a csv or json, rather than a location on S3
train_file: Optional[str] = field(
default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
)
Am I right in conluding from this that the run_summarization
script cannot be used with data from S3?
If you are using run_summarization.py you need to tell it where to look. Sagemaker will load your s3 data into /opt/ml/input/data/train/whatever.csv
or /opt/ml/input/data/test/whatever.csv
, so pass in another parameter in the hyperparameters. ie
hyperparameters={
'model_name_or_path': MODEL,
'train_file': "/opt/ml/input/data/train/whatever_train.csv",
'test_file': "/opt/ml/input/data/test/whatever_test.csv",
etc
2 Likes
Thanks for the tip! I was using the save_to_disk()
method from the Dataset
object which (I believe) writes the entire Dataset
object to S3 as an .arrow
file. I’ll try it out with a simple csv file.
1 Like
@thusken if you would like to use your dataset, which was saved with save_to_disk
you would need to fork the examples/
script you use and replace the load_dataset
method with load_from_disk
method. [REF].
2 Likes
Good to know @philschmid ! I’ll try it out with the CSV approach first, but might give it a go with the load_from_disk
method if that doesn’t work.
2 Likes