Splitting dataset from generator

Hello all!

I am making a dataset in a python generator which gets data from AWS’ s3 buckets. I am using boto3 in order to get the images and data I need. The yielded dict looks something like this:

            'bboxes': [],
            'image': Image.open(get_file(s3, 'bucket_name', img_file)).resize((1000, 1000)),
            'ner_tags': [],
            'tokens': [],
            'ids': []

Since my AWS bucket has around 14.000 documents, I thought a generator would be my best bet in order to load this data using HF datasets.

I am already aware of the from_generator() method from the Dataset class and I want to load the dataset as such:

dataset = Dataset.from_generator(s3_file_generator)

However, I also want to split the dataset into train and test datasets (80:20 ratio), and, unlike the load_dataset() function, the from_generator() method does not accept the split="train" parameter. I have tried this already:

dataset = Dataset.from_generator(s3_file_generator).train_test_split(test_size=0.2)

But this didn’t work.

I need a way to load this dataset from my python generator while still being able to split it into train and test. Is there any solution I’m missing?

I really appreciate any help you can provide. Thanks in advance!

Just an update:
After 170 minutes the following code successfully worked:

dataset = Dataset.from_generator(s3_file_generator).train_test_split(test_size=0.2)

However, due to the long run time, I suspect it downloaded every 14k dicts, instead of showing one at a time like the generator is supposed to…

So I still need to figure out some way of optimizing this code or if I am making a mistake when loading the dataset from my generator.

Hi ! A Dataset stores the data in Arrow format, so it downloads everything. If you want your dataset to download progressively as your iterate over the dataset, you can use IterableDataset.from_generator instead.

IterableDataset doesn’t implement train_test_split though, because you can’t split a generator in the middle. Instead, I’d recommend defining two generators: one for train and one for test

Hello! Thank you for your response! This worked but in the end it would return an error every epoch from the training since it would try to iterate next() from the dataset, which wouldn’t work. It ended up being more efficient to download the dataset as a Dataset object.

Thank you very much for your time!