I used the following code to upload a local dataset to a private S3 bucket. It’s very large and took about 16 hours to run.
s3_session = aiobotocore.session.AioSession(profile='default')
storage_options = {"session": s3_session}
fs = s3fs.S3FileSystem(**storage_options)
data_files = {"train": sorted(glob("path/to/parquets/*.parquet"))}
output_dir = "s3://output/dir/here"
builder = load_dataset_builder("parquet", data_files=data_files)
builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
Unfortunately when I attempt to load the dataset from the S3 bucket using load_from_disk()
I get the following error.
FileNotFoundError: Directory s3://output/dir/here is neither a Dataset directory nor a DatasetDict directory.
Not sure where I went wrong. Any help would be greatly appreciated.