How does Dataset.from_generator store data bigger than RAM?

ids = Dataset.from_generator(gen)
ids.save_to_disk("ds")

I want to run code with results like the one above but save to disk when one shard is filled instead of keeping the whole generator of data in RAM. Is there a way to manually call flush shard every n iterations?

In the petabyte case the data won’t even fit on a disk and the generator and flush shoud read and write to a cloud bucket.

1 Like

It seems like it could be done using the writer_batch_size parameter, but I’m not sure how to use it specifically…

By default, we write data to disk (so it can be memory-mapped) every 1000 rows/samples. You can control this with the writer_batch_size parameter.