Splitting dataset from generator

LuizBromfman · January 24, 2023, 6:32pm

Hello all!

I am making a dataset in a python generator which gets data from AWS’ s3 buckets. I am using boto3 in order to get the images and data I need. The yielded dict looks something like this:

{
            'bboxes': [],
            'image': Image.open(get_file(s3, 'bucket_name', img_file)).resize((1000, 1000)),
            'ner_tags': [],
            'tokens': [],
            'ids': []
}

Since my AWS bucket has around 14.000 documents, I thought a generator would be my best bet in order to load this data using HF datasets.

I am already aware of the from_generator() method from the Dataset class and I want to load the dataset as such:

dataset = Dataset.from_generator(s3_file_generator)

However, I also want to split the dataset into train and test datasets (80:20 ratio), and, unlike the load_dataset() function, the from_generator() method does not accept the split="train" parameter. I have tried this already:

dataset = Dataset.from_generator(s3_file_generator).train_test_split(test_size=0.2)

But this didn’t work.

I need a way to load this dataset from my python generator while still being able to split it into train and test. Is there any solution I’m missing?

I really appreciate any help you can provide. Thanks in advance!

LuizBromfman · January 24, 2023, 9:14pm

Hello!
Just an update:
After 170 minutes the following code successfully worked:

dataset = Dataset.from_generator(s3_file_generator).train_test_split(test_size=0.2)

However, due to the long run time, I suspect it downloaded every 14k dicts, instead of showing one at a time like the generator is supposed to…

So I still need to figure out some way of optimizing this code or if I am making a mistake when loading the dataset from my generator.

lhoestq · January 25, 2023, 12:59pm

Hi ! A Dataset stores the data in Arrow format, so it downloads everything. If you want your dataset to download progressively as your iterate over the dataset, you can use IterableDataset.from_generator instead.

IterableDataset doesn’t implement train_test_split though, because you can’t split a generator in the middle. Instead, I’d recommend defining two generators: one for train and one for test

LuizBromfman · January 26, 2023, 12:36pm

Hello! Thank you for your response! This worked but in the end it would return an error every epoch from the training since it would try to iterate next() from the dataset, which wouldn’t work. It ended up being more efficient to download the dataset as a Dataset object.

Thank you very much for your time!

Topic		Replies	Views
Using config_kwargs within the load_dataset Beginners	2	974	September 20, 2023
Slow in generating train split when loading local dataset 🤗Datasets	1	1568	January 12, 2024
Splitting Dataset in the dataset loading script 🤗Datasets	1	600	September 16, 2022
Image Dataset Generation gets killed 🤗Datasets	5	583	September 8, 2023
Load pre-existing in-memory splits into a Dataset 🤗Datasets	2	1025	November 16, 2021

Splitting dataset from generator

Related topics