Strategy for generating a large dataset

rcasero · January 6, 2023, 11:24am

Hi, I have to generate a dataset from 1,000+ large files by:

making a random choice with replacement of a file per example (fast, this step takes a total of ~1 min for all examples). We need to keep a list of labels per file that describe some categories the file belongs to.
sampling each of the chosen files at a random location (slow, ~a few days) and extracting a numerical vector per example

Some constraints:

the data is proprietary and the dataset cannot be uploaded online, it has to remain on a network share
the data files are large (total of all files ~2T), so we don’t want to copy them unnecessarily
we need a copy of the examples for analysis and auditing purposes
we’d like to run step 1 first (the choice of files), to analyse what kind of dataset will be generated (each file belongs to several categories that need to be balanced in the dataset to be generated)
between inputs/outputs, each example has ~1,500 values.
we’ll generate ~1 million examples
As the file sampling may take several days, it’d be good to make intermediate saves, so if something breaks, we don’t need to restart from the beginning

What would be a good strategy for this?

I’m thinking

Generate a DataFrame with the random choices of files and their labels/categories and save to disk. It’ll be quick, allows to analyse what balance of categories we’ll get, and it’s a hard copy.
Write the logic to sample the files listed in said DataFrame
using DatasetBuilder.generate_examples(). Not sure about the best way to connect both.
Use Dataset.save_to_disk() to save the generated examples at periodic intervals. Not sure about the best way to do this either.

I’d be grateful for any comments and recommendations!

lhoestq · January 13, 2023, 4:50pm

datasets allows to load datasets that are bigger than memory, and you can also use a python generator function to define a dataset. You can take a look at Dataset.from_generator

Although your startegy doesn’t seem super efficient, because for each point you’d have to load a new file in memory to sample from it.

You may consider doing this instead:

choose how many vectors per file you want to sample. In your case you should end up with 1M vectors in total. You can run your analysis on this before generating the actual dataset.
open each file one by one and sample the corresponding amount of vectors to a dataset
then shuffle the full dataset

This way you only load each file once in memory; it should save you a lot of time.

rcasero · January 27, 2023, 4:53pm

Hi @lhoestq ,

Thanks for the pointer to Dataset.from_generator, that could be useful for my problem. Regarding your strategy comments, there a couple more things to consider:

By design, different files will be sampled a different number of times. (The probability of choosing each file for the next sample is different).
With formats like HDF5, you don’t need to read the whole file, just the sample you need and a bit more around it. Thus, I don’t think that grouping samples by file would be a lot more efficient. You would just save the time of re-opening the file, that seems to be negligible.

316usman · October 28, 2023, 3:25pm

@lhoestq I have some what same problem to deal with. I have pdf files that i load t process and then wat data i get from them i need it to pushed to hub after each file. Does the from_generator work in this case. or what would you suggest please

Topic		Replies	Views
How to create a new large Dataset on disk? 🤗Datasets	10	3260	July 6, 2022
Creating dataset slow 🤗Datasets	5	140	December 18, 2024
Serially creating a very large dataset using from_generator(), slower each iteration (slows to >2-3s per example!) 🤗Datasets	1	766	May 18, 2023
Create new Dataset for very large files that need to be sampled 🤗Datasets	1	747	November 4, 2022
Slow in generating train split when loading local dataset 🤗Datasets	1	1582	January 12, 2024

Strategy for generating a large dataset

Related topics