How to serialise very large generator to disk

Hi there,

I have a very large generator that returns data that I would like to serialise as an Huggingface dataset. I’ve decided to use the new functionality Dataset.from_generator. However, it looks to me that, despite having the flag keep_in_memory=False, my RAM keeps filling up and it’s not flushed to disk. At no point in time I’m materialising my generator so I don’t think this is a problem with my code.

My features are defined as follow, and I make sure that the generator returns dictionaries having the following structure:

 def dataset_features():
    return Features(
        dict(
            id=Value("string"),
            question=Value("string"),
            answer=Value("string"),
            frames=Sequence(Image()),
            metadata={"data_id": Value("string"), "question_type": Value("string")},
        )
    )

Then, I create my dataset using the generator as follows:

ds = Dataset.from_generator(
    data_generator,
    features=features,
    gen_kwargs={
        "traj_files": traj_files,
        "loader": loader,
        "question_generators": question_generators,
        "args": args,
    },
)

ds.save_to_disk(final_data_path)

Considering that by default keep_in_memory=False, I am expecting Huggingface to flush the content to disk instead of loading it in memory. Could you please help? Ping to @lhoestq who might know the answer to this!

Hi! Are you sure that the generator itself is not filling up the memory? You can check this by iterating over it and tracking the memory usage with psutil. Also, the generated examples are flushed to disk every 10000 iterations, which may explain the RAM usage. I’ve opened a PR to expose this number (writer_batch_size) in from_generator.

Thank you. Maybe after that number I’ve already accumulated many data points that might contain several images. So I guess it would make sense to somehow reduce the number of iterations to a value that is more suitable.