Host and share datasets: S3

Basically I have a text file, and I want to save it on S3 and use python “datasets” library to load and share for collaboration. I followed the instructions mentioned here. Cloud storage

I have uploaded the text on S3.
I am trying use datasets to list the data from S3 and getting access issues.
I want to use load_from_disk() function later to load data from S3.

import datasets
s3 = datasets.filesystems.S3FileSystem(anon=True)
s3.ls('text-data/processed/')

I did check aws s3 ls s3://text-data/processed/ and I am able to list the objects and hence doesn’t seems like access issue.

A good example would be easier for a beginner to start.

Please let me know if I am missing something. Thanks.

Hi ! load_from_disk can be used to load an Arrow dataset exported with save_to_disk.

To load a text file from your S3, you can use load_dataset and pass the URLs of your text files as data_files, for example:

from datasets import load_dataset

url = "https://s3.amazonaws.com/conceptnet/downloads/2018/omcs-sentences-free.txt"
ds = load_dataset("text", data_files={"train": [url]})