How to load large dataset with streaming mode and prepare for training?

I can load dataset with streaming mode, but I am confused, how to prepare for training to iteratively train the model on whole dataset.

If any one can provide a notebook so this will be very helpful.
@lhoestq

What are you using for training ?

If you have your own training loop you can use a DataLoader with the streaming dataset

Here is the complete code please check it

Your issue doesn’t seem to be related to the dataset, feel free to continue the discussion in your github issue

My question is, how to iteratively train the model , if the dataset in streaming mode.

Can you provide any notebook, I just want to learn the concept/tricks etc.

You can find cod examples on how to use a streaming dataset in your own training loop here: Stream

It’s generally a good starting point if you want to adapt it to your use case :slight_smile:

Thank you. I would like to know, can I use this with trainer API ?
Actually I want, to train the model on dataset using streaming mode. Where the trainer API download automatically, chanks or batch etc and tokenize and train and so on iteratively. By doing this I will save my ram.

You can pass your chunk and tokenize function to your streaming dataset using .map(), and then pass the dataset to the Trainer. The chunking and tokenization will happen iteratively during training

Streaming=True not support map.

Actually it does ! see https://huggingface.co/docs/datasets/v2.14.5/en/stream#map

Thank you, :heartpulse: