Perhaps the data set has become too large, exceeding the capacity of the RAM, and you are using an SSD or HDD as a substitute for RAM?
You might want to look for some know-how on creating large data sets.
I have my own library which processes corpora of documents much larger than what fits into memory.
I would like to create a new HF Dataset by incrementally adding/streaming new examples to it, but because of the number of examples the dataset would not fit into memory.
I am not sure if add_item should get used for this as it does not seem to modify the dataset in place but rather returns a new dataset which looks like the dataset would actually be immutable and add_item would create a new data…
Hi, I have to generate a dataset from 1,000+ large files by:
making a random choice with replacement of a file per example (fast, this step takes a total of ~1 min for all examples). We need to keep a list of labels per file that describe some categories the file belongs to.
sampling each of the chosen files at a random location (slow, ~a few days) and extracting a numerical vector per example
Some constraints:
the data is proprietary and the dataset cannot be uploaded online, it has to rem…