Data not passed to run_summarization.py

thusken · March 8, 2022, 1:57pm

Looking at the source code for run_summarization.py, the error seems to be in the __post_init__ call of DataTrainingArguments here:

if self.dataset_name is None and self.train_file is None and self.validation_file is None:
            raise ValueError("Need either a dataset name or a training/validation file.")

So it looks like the data from S3 is not parsed to train_file, looking at the train_file argument it seems to expect a csv or json, rather than a location on S3

train_file: Optional[str] = field(
        default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
    )

Am I right in conluding from this that the run_summarization script cannot be used with data from S3?

Topic		Replies	Views
Using custom csv data with run_summarization.py in sagemaker Amazon SageMaker	4	2078	June 16, 2021
Running custom data files on run_summarization.py Amazon SageMaker	16	1464	June 22, 2021
Sagemaker Text Summarization Fine Tuning Job failing Amazon SageMaker	6	1588	June 9, 2022
Fine Tuning GPT-2 - Training job only using test sample size of 5 Amazon SageMaker	4	2152	February 6, 2023
Creating Vision dataset with images on s3 Amazon SageMaker	9	2572	September 15, 2022

Data not passed to run_summarization.py

Related topics