Host and share datasets: S3

Pannis · July 14, 2022, 7:05am

Basically I have a text file, and I want to save it on S3 and use python “datasets” library to load and share for collaboration. I followed the instructions mentioned here. Cloud storage

I have uploaded the text on S3.
I am trying use datasets to list the data from S3 and getting access issues.
I want to use load_from_disk() function later to load data from S3.

import datasets
s3 = datasets.filesystems.S3FileSystem(anon=True)
s3.ls('text-data/processed/')

I did check aws s3 ls s3://text-data/processed/ and I am able to list the objects and hence doesn’t seems like access issue.

A good example would be easier for a beginner to start.

Please let me know if I am missing something. Thanks.

lhoestq · July 22, 2022, 11:55am

Hi ! load_from_disk can be used to load an Arrow dataset exported with save_to_disk.

To load a text file from your S3, you can use load_dataset and pass the URLs of your text files as data_files, for example:

from datasets import load_dataset

url = "https://s3.amazonaws.com/conceptnet/downloads/2018/omcs-sentences-free.txt"
ds = load_dataset("text", data_files={"train": [url]})

Topic		Replies	Views
How to write a dataset load script using private S3 storage 🤗Datasets	2	1344	December 1, 2022
Help creating dataset from s3 bucket with parquet files 🤗Datasets	2	1091	July 27, 2023
How to use S3 path with `load_dataset` with streaming=True? 🤗Datasets	11	7524	November 23, 2022
Connect to MinIO using S3 🤗Datasets	3	570	October 11, 2023
How can I convert a loaded dataset in to a parquet file and save it to the S3 🤗Datasets	2	4324	July 31, 2023

Host and share datasets: S3

Related topics