Hi, I have to generate a dataset from 1,000+ large files by:
- making a random choice with replacement of a file per example (fast, this step takes a total of ~1 min for all examples). We need to keep a list of labels per file that describe some categories the file belongs to.
- sampling each of the chosen files at a random location (slow, ~a few days) and extracting a numerical vector per example
Some constraints:
- the data is proprietary and the dataset cannot be uploaded online, it has to remain on a network share
- the data files are large (total of all files ~2T), so we don’t want to copy them unnecessarily
- we need a copy of the examples for analysis and auditing purposes
- we’d like to run step 1 first (the choice of files), to analyse what kind of dataset will be generated (each file belongs to several categories that need to be balanced in the dataset to be generated)
- between inputs/outputs, each example has ~1,500 values.
- we’ll generate ~1 million examples
- As the file sampling may take several days, it’d be good to make intermediate saves, so if something breaks, we don’t need to restart from the beginning
What would be a good strategy for this?
I’m thinking
- Generate a DataFrame with the random choices of files and their labels/categories and save to disk. It’ll be quick, allows to analyse what balance of categories we’ll get, and it’s a hard copy.
- Write the logic to sample the files listed in said DataFrame
using DatasetBuilder.generate_examples(). Not sure about the best way to connect both. - Use Dataset.save_to_disk() to save the generated examples at periodic intervals. Not sure about the best way to do this either.
I’d be grateful for any comments and recommendations!