Hi,
I’m working on a dataset generation script that uses from_generator()
to construct a dataset. It’s working fine for constructing a basic one, but what I would like to do is to generate different subsets at the same time. The reason for this is that at each iteration of my data generation pipeline, multiple samples are generated, and each sample belongs to a specific subset of the dataset. Generating each subset separately would be very inefficient. Current options I can think of to achieve this:
- Add a “subset” column to map samples to their corresponding subset, then at the end of the generation use it to filter the dataset and construct the individual subsets before saving to disk
- Instead of using
from_generator()
, useadd_item()
to iteratively construct the subsets as separate datasets
I think both approaches are sub-optimal, but I can’t think of anything better.
Any help is appreciated.
Thanks!