Data not passed to


I’m in the process of trying to finetune the NbAiLab/nb-t5-base model on my own dataset. The training job starts fine when calling, but gives a ValueError after the training image has been downloaded:

ValueError: Need either a dataset name or a training/validation file.

So it appears that the training data that is passed in is not
passed correctly to the script. Any pointers? Simplified script is below.

CHECKPOINT = "NbAiLab/nb-t5-base"

train_dataset = Dataset.from_pandas(train_df)  # df with columns 'text' and 'labels'
test_dataset = Dataset.from_pandas(test_df)

tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)

def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)
train_dataset =, batched=True)
test_dataset =, batched=True)

train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

s3 = S3FileSystem()  

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/data/train'

# save preprocessed test data to S3
test_input_path = f"s3://{session.default_bucket()}/data/test"
test_dataset.save_to_disk(test_input_path, fs=s3)

hyperparameters = {
        "epochs": 1,
        "train_batch_size": 32

# git configuration to download our fine-tuning script
git_config = {'repo': '','branch': 'v4.6.1'}

# creates Hugging Face estimator
huggingface_estimator = sagemaker.huggingface.HuggingFace(
	hyperparameters = hyperparameters

how are you calling fit() method?

So, looks like it was a typo causing the problems. I have used both sess.default_bucket() and sessìon.default_bucket(), while sess is never defined. My bad!

For reference, I called the fit method as follows:

#starting the train job
        "train": training_input_path,
        "test": test_input_path

Unfortunately, the problem persists even after fixing the typo’s in the default buckets.

Calling the .fit() method as follows still results in the same error. I have verified that training_input_path and test_input_path point to the correct location on S3 (the location that is passed to the .save_to_disk() call from two Dataset objects`)
        "train": training_input_path,
        "test": test_input_path

This still results in the following ValueError, any help is still appreciated!

ValueError: Need either a dataset name or a training/validation file.

Looking at the source code for, the error seems to be in the __post_init__ call of DataTrainingArguments here:

if self.dataset_name is None and self.train_file is None and self.validation_file is None:
            raise ValueError("Need either a dataset name or a training/validation file.")

So it looks like the data from S3 is not parsed to train_file, looking at the train_file argument it seems to expect a csv or json, rather than a location on S3

train_file: Optional[str] = field(
        default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}

Am I right in conluding from this that the run_summarization script cannot be used with data from S3?

If you are using you need to tell it where to look. Sagemaker will load your s3 data into /opt/ml/input/data/train/whatever.csv or /opt/ml/input/data/test/whatever.csv, so pass in another parameter in the hyperparameters. ie

    'model_name_or_path': MODEL,
    'train_file': "/opt/ml/input/data/train/whatever_train.csv",
    'test_file': "/opt/ml/input/data/test/whatever_test.csv",

Thanks for the tip! I was using the save_to_disk() method from the Dataset object which (I believe) writes the entire Dataset object to S3 as an .arrow file. I’ll try it out with a simple csv file.

1 Like

@thusken if you would like to use your dataset, which was saved with save_to_disk you would need to fork the examples/ script you use and replace the load_dataset method with load_from_disk method. [REF].


Good to know @philschmid ! I’ll try it out with the CSV approach first, but might give it a go with the load_from_disk method if that doesn’t work.