How to serialise very large generator to disk

asuglia · September 30, 2022, 9:42am

Hi there,

I have a very large generator that returns data that I would like to serialise as an Huggingface dataset. I’ve decided to use the new functionality Dataset.from_generator. However, it looks to me that, despite having the flag keep_in_memory=False, my RAM keeps filling up and it’s not flushed to disk. At no point in time I’m materialising my generator so I don’t think this is a problem with my code.

My features are defined as follow, and I make sure that the generator returns dictionaries having the following structure:

 def dataset_features():
    return Features(
        dict(
            id=Value("string"),
            question=Value("string"),
            answer=Value("string"),
            frames=Sequence(Image()),
            metadata={"data_id": Value("string"), "question_type": Value("string")},
        )
    )

Then, I create my dataset using the generator as follows:

ds = Dataset.from_generator(
    data_generator,
    features=features,
    gen_kwargs={
        "traj_files": traj_files,
        "loader": loader,
        "question_generators": question_generators,
        "args": args,
    },
)

ds.save_to_disk(final_data_path)

Considering that by default keep_in_memory=False, I am expecting Huggingface to flush the content to disk instead of loading it in memory. Could you please help? Ping to @lhoestq who might know the answer to this!

mariosasko · September 30, 2022, 12:33pm

Hi! Are you sure that the generator itself is not filling up the memory? You can check this by iterating over it and tracking the memory usage with psutil. Also, the generated examples are flushed to disk every 10000 iterations, which may explain the RAM usage. I’ve opened a PR to expose this number (writer_batch_size) in from_generator.

asuglia · September 30, 2022, 1:46pm

Thank you. Maybe after that number I’ve already accumulated many data points that might contain several images. So I guess it would make sense to somehow reduce the number of iterations to a value that is more suitable.

Topic		Replies	Views
How does Dataset.from_generator store data bigger than RAM? 🤗Datasets	1	19	June 19, 2025
Expected memory usage of Dataset Beginners	1	2786	July 4, 2023
[ Dataset.from_generator ] Prevent caching during upload 🤗Datasets	2	261	May 15, 2024
GeneratorBasedBuilder gets stuck & consumes all RAM 🤗Datasets	2	786	February 8, 2022
Create a dataset from generator 🤗Datasets	7	7785	January 30, 2024

How to serialise very large generator to disk

Related topics