I’m trying to create dataset by using .from_generator(). At the beginning the first 1000 items is processed very fast. However, after 1000, the processing speed is extremely slow. I’m curious why this happen? I’m really appreciate if anyone can help me with this problem.
Perhaps the data set has become too large, exceeding the capacity of the RAM, and you are using an SSD or HDD as a substitute for RAM?
You might want to look for some know-how on creating large data sets.
Thanks for the reply. Here are some details.
I read data from hdf5 files. And trying to build a hf dataset for training purpose.
The item of my dataset is contained with 3 images and labels.
For instance, pseudo code like
mygen():
files = readfiles(path)
for hdf5_item in files:
with readfile(hdf5) :
yield{{“img1”,img1},{“img2”,img2},{“img3”,img3}}.
The resolutions of imgs are 640X1080X3. It contains larger than 40,000 items.
I would appreciate it if you could provide some concrete examples or ideas for building a dataset like this.
Anyway, I think that using from_generator or from_list will use up too much RAM. If you want to create a very large dataset, you can define a script to load the dataset, but if you can limit the dataset to images, you might be able to use the following method.
In short, you can create a dataset just by uploading it to HF in separate directories.
Thanks for the reply. Is from_generator the most efficient way to create the large-scale dataset? I think the problem is about the RAM. I’m planning to create and save the dataset locally.
I don’t have much experience with datasets libraries, so please consider this as a reference only.
from_ is simple and convenient, but I don’t think it’s suitable for creating large datasets. Unless you add more RAM to your PC…
On the other hand, I think the method of creating on disk and then loading at the end, or writing a script dedicated to loading, is suitable for creating huge datasets.
When reading the dataset, IterableDataset (streaming) should be available.
This saves RAM by reducing the amount of data loaded at once. I don’t know how to use it when creating a dataset.