Streaming dataset and cache

Hello!

I was wondering about the inner working of dataset streaming.

Is there any form of caching of the samples that will be needed soon or are they downloaded and processed only when the model requires them for training ?

In the later case, we could expect a higher latency when going from a batch to another, especially when the samples require some heavy processing. In this case it’s probably better to process the dataset locally beforehand.

Cheers

hi @BrunoHays! Yes, currently it’s the latter case - with remote datasets loaded with streaming=True nothing is cached/downloaded on disk, samples are loaded in memory only at a time of iteration. If dataset requires a lot of processing, it’s probably better to load it once, do all the necessary processing steps and save a processed version to disk with Dataset.save_to_disk. Later you can load it with load_from_disk. If you want to convert a loaded Dataset object to a streamable IterableDataset (for example, if you want to use shuffling and don’t want it to make the speed slower, you can use .to_iterable_dataset() method.

1 Like

Very clear thank you.

My issue is that the processed dataset is very large compared to the raw dataset.

Just in case, is this also true if we have loaded the raw dataset locally,and then then stream it with .to_iterable_dataset() ?

In that case the processings are done on the fly if I’m not mistaken. But are they still computed only at a time of iteration ?

@BrunoHays yes, for iterable datasets streamed from local files it’s the same, if the processing is on the fly, the processed examples are available only at time of iteration, they are not cached anywhere. Feel free to open an issue in the Datasets repo if you think it would be useful to have an option to cache streaming datasets somehow :slight_smile:

1 Like