Maybe a related feature request/question:
Can load_dataset support _io.TextIOWrapper types instead of file paths?
Currently, filepath is supported:
infile = "../input/tatoeba/tatoeba-sentpairs.tsv"
train_data = load_dataset("csv", data_files=infile,
streaming=True, delimiter="\t", split="train")
but sometimes, it’s very useful if we can do something like:
infile = open("../input/tatoeba/tatoeba-sentpairs.tsv") # -> _io.TextIOWrapper
train_data = load_dataset("csv", data_files=infile,
streaming=True, delimiter="\t", split="train")
If we can do the latter, then maybe we can pipe/stream through s3 like this:
from datasets.filesystem import S3FileSystem
s3 = S3FileSystem()
infile = s3.open("../input/tatoeba/tatoeba-sentpairs.tsv") # -> _io.TextIOWrapper
train_data = load_dataset("csv", data_files=infile,
streaming=True, delimiter="\t", split="train")