Creating dataset slow

neilhavefun · December 18, 2024, 5:34am

I’m trying to create dataset by using .from_generator(). At the beginning the first 1000 items is processed very fast. However, after 1000, the processing speed is extremely slow. I’m curious why this happen? I’m really appreciate if anyone can help me with this problem.

John6666 · December 18, 2024, 5:49am

Perhaps the data set has become too large, exceeding the capacity of the RAM, and you are using an SSD or HDD as a substitute for RAM?
You might want to look for some know-how on creating large data sets.

neilhavefun · December 18, 2024, 6:50am

Thanks for the reply. Here are some details.
I read data from hdf5 files. And trying to build a hf dataset for training purpose.
The item of my dataset is contained with 3 images and labels.
For instance, pseudo code like
mygen():
files = readfiles(path)
for hdf5_item in files:
with readfile(hdf5) :
yield{{“img1”,img1},{“img2”,img2},{“img3”,img3}}.

The resolutions of imgs are 640X1080X3. It contains larger than 40,000 items.
I would appreciate it if you could provide some concrete examples or ideas for building a dataset like this.

John6666 · December 18, 2024, 7:15am

Anyway, I think that using from_generator or from_list will use up too much RAM. If you want to create a very large dataset, you can define a script to load the dataset, but if you can limit the dataset to images, you might be able to use the following method.
In short, you can create a dataset just by uploading it to HF in separate directories.

neilhavefun · December 18, 2024, 7:30am

Thanks for the reply. Is from_generator the most efficient way to create the large-scale dataset? I think the problem is about the RAM. I’m planning to create and save the dataset locally.

John6666 · December 18, 2024, 7:37am

I don’t have much experience with datasets libraries, so please consider this as a reference only.
from_ is simple and convenient, but I don’t think it’s suitable for creating large datasets. Unless you add more RAM to your PC…
On the other hand, I think the method of creating on disk and then loading at the end, or writing a script dedicated to loading, is suitable for creating huge datasets.

When reading the dataset, IterableDataset (streaming) should be available.
This saves RAM by reducing the amount of data loaded at once. I don’t know how to use it when creating a dataset.

Topic		Replies	Views
How to create a new large Dataset on disk? 🤗Datasets	10	3258	July 6, 2022
Create HF dataset from h5 🤗Datasets	3	2316	October 20, 2021
Serially creating a very large dataset using from_generator(), slower each iteration (slows to >2-3s per example!) 🤗Datasets	1	764	May 18, 2023
Uploading a dataset that doesn't fit in memory to the HF hub 🤗Datasets	5	73	October 24, 2024
Creating a HF Dataset from lakeFS with S3 storage takes too much time! 🤗Datasets	7	49	June 23, 2025

Creating dataset slow

Related topics