What’s the definiation of lazy loading? Is IterableDataset also faster than Dataset when loading locally?

What’s the definiation of lazy loading? Do the IterableDataset and Dataset decided whether there is the lazy loading? I think lazy loading is that we don’t load all the data at the same time. So only we used IterableDataset , lazy loading will happen.

Another question comes out. Does IterableDataset use memory-mapping and zero-copy to retrive data? Will both IterableDataset and Dataset occupy the same RAM when loading the same datasets? If we just retrive data without shuffle and locally, the speed differece between IterableDataset and Dataset is because contiguous sequential access is faster than random access, right?

1 Like

Aside from definitions and general aspects, I think only the author or maintainer can really understand the implementation… @lhoestq

1 Like

Thank you John! That link is very helpful!

There is a confusion about: “But one caveat is that you must have the entire dataset stored on your disk or in memory, which blocks you from accessing datasets bigger than the disk.” Does memory refer to RAM? I can understand dataset is larger than disk, but I think load_dataset can covert other file format to .arrow, and it occupied low RAM, right?

1 Like

And also I noticed huge virtual memory(around 100G, and my dataset is also around 100G) is occupied when I use load_from_disk or load_dataset without streaming to load .arrow files. Is that normal? I see the blog, and for my understanding, zero_copy utilizes the virtual memory indeed, and the size of VM is related to the size of datasets, right?

Thank you!

1 Like

I’ve never worked with huge datasets…

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.