Hi there,
I have a very large generator that returns data that I would like to serialise as an Huggingface dataset. I’ve decided to use the new functionality Dataset.from_generator
. However, it looks to me that, despite having the flag keep_in_memory=False
, my RAM keeps filling up and it’s not flushed to disk. At no point in time I’m materialising my generator so I don’t think this is a problem with my code.
My features are defined as follow, and I make sure that the generator returns dictionaries having the following structure:
def dataset_features():
return Features(
dict(
id=Value("string"),
question=Value("string"),
answer=Value("string"),
frames=Sequence(Image()),
metadata={"data_id": Value("string"), "question_type": Value("string")},
)
)
Then, I create my dataset using the generator as follows:
ds = Dataset.from_generator(
data_generator,
features=features,
gen_kwargs={
"traj_files": traj_files,
"loader": loader,
"question_generators": question_generators,
"args": args,
},
)
ds.save_to_disk(final_data_path)
Considering that by default keep_in_memory=False
, I am expecting Huggingface to flush the content to disk instead of loading it in memory. Could you please help? Ping to @lhoestq who might know the answer to this!