Streaming dataset and cache

Hello!

I was wondering about the inner working of dataset streaming.

Is there any form of caching of the samples that will be needed soon or are they downloaded and processed only when the model requires them for training ?

In the later case, we could expect a higher latency when going from a batch to another, especially when the samples require some heavy processing. In this case it’s probably better to process the dataset locally beforehand.

Cheers

hi @BrunoHays! Yes, currently it’s the latter case - with remote datasets loaded with streaming=True nothing is cached/downloaded on disk, samples are loaded in memory only at a time of iteration. If dataset requires a lot of processing, it’s probably better to load it once, do all the necessary processing steps and save a processed version to disk with Dataset.save_to_disk. Later you can load it with load_from_disk. If you want to convert a loaded Dataset object to a streamable IterableDataset (for example, if you want to use shuffling and don’t want it to make the speed slower, you can use .to_iterable_dataset() method.

1 Like

Very clear thank you.

My issue is that the processed dataset is very large compared to the raw dataset.

Just in case, is this also true if we have loaded the raw dataset locally,and then then stream it with .to_iterable_dataset() ?

In that case the processings are done on the fly if I’m not mistaken. But are they still computed only at a time of iteration ?

@BrunoHays yes, for iterable datasets streamed from local files it’s the same, if the processing is on the fly, the processed examples are available only at time of iteration, they are not cached anywhere. Feel free to open an issue in the Datasets repo if you think it would be useful to have an option to cache streaming datasets somehow :slight_smile:

1 Like

For anyone else stumbling upon this thread, I’ve found that Pytorch implements the sort of caching I was talking about through the num_workers and pin_memory parameters of the dataloader. For the HF trainer, the parameter is named dataloader_num_workers.
If num_workers > 0, one or more process that are independent from the main one perform the processing while the forward/backward passes are happening. Nothing is stored on disk though, it relies on RAM

Since datasets are typically iterated in a predictable way (unless shuffled - but with a fixed seed this should be possible as well)

Is it possible to configure a lookahead range, where the dataset is pre-downloaded, and pre-processed, so that there is little to no latency delay once ramped up between batches.

For TB size dataset, having a look ahead of a few MB might be sufficient with a good internet connection, while keeping the GPU saturated