Hello all!
I am making a dataset in a python generator which gets data from AWS’ s3 buckets. I am using boto3 in order to get the images and data I need. The yielded dict looks something like this:
{
'bboxes': [],
'image': Image.open(get_file(s3, 'bucket_name', img_file)).resize((1000, 1000)),
'ner_tags': [],
'tokens': [],
'ids': []
}
Since my AWS bucket has around 14.000 documents, I thought a generator would be my best bet in order to load this data using HF datasets.
I am already aware of the from_generator()
method from the Dataset class and I want to load the dataset as such:
dataset = Dataset.from_generator(s3_file_generator)
However, I also want to split the dataset into train and test datasets (80:20 ratio), and, unlike the load_dataset()
function, the from_generator()
method does not accept the split="train"
parameter. I have tried this already:
dataset = Dataset.from_generator(s3_file_generator).train_test_split(test_size=0.2)
But this didn’t work.
I need a way to load this dataset from my python generator while still being able to split it into train and test. Is there any solution I’m missing?
I really appreciate any help you can provide. Thanks in advance!