Strategy for generating a large dataset

Hi, I have to generate a dataset from 1,000+ large files by:

  1. making a random choice with replacement of a file per example (fast, this step takes a total of ~1 min for all examples). We need to keep a list of labels per file that describe some categories the file belongs to.
  2. sampling each of the chosen files at a random location (slow, ~a few days) and extracting a numerical vector per example

Some constraints:

  • the data is proprietary and the dataset cannot be uploaded online, it has to remain on a network share
  • the data files are large (total of all files ~2T), so we don’t want to copy them unnecessarily
  • we need a copy of the examples for analysis and auditing purposes
  • we’d like to run step 1 first (the choice of files), to analyse what kind of dataset will be generated (each file belongs to several categories that need to be balanced in the dataset to be generated)
  • between inputs/outputs, each example has ~1,500 values.
  • we’ll generate ~1 million examples
  • As the file sampling may take several days, it’d be good to make intermediate saves, so if something breaks, we don’t need to restart from the beginning

What would be a good strategy for this?

I’m thinking

  1. Generate a DataFrame with the random choices of files and their labels/categories and save to disk. It’ll be quick, allows to analyse what balance of categories we’ll get, and it’s a hard copy.
  2. Write the logic to sample the files listed in said DataFrame
    using DatasetBuilder.generate_examples(). Not sure about the best way to connect both.
  3. Use Dataset.save_to_disk() to save the generated examples at periodic intervals. Not sure about the best way to do this either.

I’d be grateful for any comments and recommendations!

datasets allows to load datasets that are bigger than memory, and you can also use a python generator function to define a dataset. You can take a look at Dataset.from_generator

Although your startegy doesn’t seem super efficient, because for each point you’d have to load a new file in memory to sample from it.

You may consider doing this instead:

  • choose how many vectors per file you want to sample. In your case you should end up with 1M vectors in total. You can run your analysis on this before generating the actual dataset.
  • open each file one by one and sample the corresponding amount of vectors to a dataset
  • then shuffle the full dataset

This way you only load each file once in memory; it should save you a lot of time.

1 Like

Hi @lhoestq ,

Thanks for the pointer to Dataset.from_generator, that could be useful for my problem. Regarding your strategy comments, there a couple more things to consider:

  • By design, different files will be sampled a different number of times. (The probability of choosing each file for the next sample is different).
  • With formats like HDF5, you don’t need to read the whole file, just the sample you need and a bit more around it. Thus, I don’t think that grouping samples by file would be a lot more efficient. You would just save the time of re-opening the file, that seems to be negligible.

@lhoestq I have some what same problem to deal with. I have pdf files that i load t process and then wat data i get from them i need it to pushed to hub after each file. Does the from_generator work in this case. or what would you suggest please