Streaming dataset and cache

BrunoHays · April 11, 2023, 4:29pm

For anyone else stumbling upon this thread, I’ve found that Pytorch implements the sort of caching I was talking about through the num_workers and pin_memory parameters of the dataloader. For the HF trainer, the parameter is named dataloader_num_workers.
If num_workers > 0, one or more process that are independent from the main one perform the processing while the forward/backward passes are happening. Nothing is stored on disk though, it relies on RAM

Topic		Replies	Views
batched I/O from disk when load_dataset API is used? 🤗Datasets	2	27	January 27, 2025
Roadmap/timeline for dataset streaming 🤗Datasets	9	2271	July 5, 2021
Question about streaming 🤗Datasets	3	573	April 25, 2023
A streaming dataset's memory footprint continually grows 🤗Datasets	8	72	June 19, 2025
How do i batch in streaming of data set Intermediate	1	43	May 3, 2025

Streaming dataset and cache

Related topics