How to load large dataset with streaming mode and prepare for training?

Imran1 · November 2, 2023, 9:08am

I can load dataset with streaming mode, but I am confused, how to prepare for training to iteratively train the model on whole dataset.

If any one can provide a notebook so this will be very helpful.
@lhoestq

lhoestq · November 2, 2023, 5:05pm

What are you using for training ?

If you have your own training loop you can use a DataLoader with the streaming dataset

Imran1 · November 3, 2023, 1:35am

Here is the complete code please check it

github.com/huggingface/trl

fine tune zephyar with large dataset.

opened 06:18AM - 01 Nov 23 UTC

imrankh46

hi, i am try to SFT( fine tune ) zephyar model with ultrachat200k dataset. but i…t show cuda out of memory issues. how to load dataset with streaming and prepare for training for each chank? here is the code. ``` !pip install --upgrade "transformers" "datasets" "peft" "accelerate" "bitsandbytes" "safetensors" "trl" "wandb" !pip install -U git+https://github.com/huggingface/trl.git@main #!pip install -U flash-attn !git config --global credential.helper store !huggingface-cli login --token 'token' --add-to-git-credential from datasets import load_dataset dataset_base = load_dataset('HuggingFaceH4/ultrachat_200k',streaming=True) dataset_base def formatting_func(example): instruction = '### Instruction:\n Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n' input_prompt = f"### Prompt:\n{example['prompt']}\n\n" # Check if there's an instruction and include it if 'context' in example: input_prompt += f"### Instruction:\n{example['context']}\n\n" input_prompt += "### Conversation:\n" for message in example['messages']: input_prompt += f"{message['role']}: {message['content']}\n\n" text = instruction + input_prompt return {"text": text} # Select the splits you want to format splits_to_format = ['train_sft', 'test_sft'] # Apply the formatting function to the selected splits for split_name in splits_to_format: dataset_base[split_name] = dataset_base[split_name].map(formatting_func) # Now, your 'train_sft' and 'test_sft' splits have been formatted using the 'formatting_func'. # You can access them as follows: train_sft = dataset_base['train_sft'] test_sft = dataset_base['test_sft'] print(train_sft[2]["text"]) import torch import transformers from peft import LoraConfig from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer model_id = "HuggingFaceH4/zephyr-7b-beta" qlora_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, target_modules=["q_proj", "v_proj"], bias="none", task_type="CAUSAL_LM" ) bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) base_model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, trust_remote_code=True, #use_flash_attention_2=True # using flesh attention v2 #use_auth_token=True, ) base_model tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" from trl import SFTTrainer supervised_finetuning_trainer = SFTTrainer( base_model, train_dataset=train_sft, eval_dataset=test_sft, args=transformers.TrainingArguments( output_dir="sft_z", max_steps=500, logging_steps=10, save_steps=10000, per_device_train_batch_size=2, per_device_eval_batch_size=2, gradient_accumulation_steps=8, gradient_checkpointing=False, group_by_length=False, learning_rate=1e-4, lr_scheduler_type="cosine", warmup_steps=100, weight_decay=0.05, optim="paged_adamw_8bit", fp16=True, remove_unused_columns=False, run_name="sft_zephyar", report_to="wandb", ), tokenizer=tokenizer, peft_config=qlora_config, dataset_text_field="text", max_seq_length=512, neftune_noise_alpha=5 ) supervised_finetuning_trainer.train() ```

lhoestq · November 3, 2023, 11:13am

Your issue doesn’t seem to be related to the dataset, feel free to continue the discussion in your github issue

Imran1 · November 3, 2023, 11:41am

My question is, how to iteratively train the model , if the dataset in streaming mode.

Can you provide any notebook, I just want to learn the concept/tricks etc.

lhoestq · November 3, 2023, 11:49am

You can find cod examples on how to use a streaming dataset in your own training loop here: Stream

It’s generally a good starting point if you want to adapt it to your use case

Imran1 · November 3, 2023, 12:08pm

Thank you. I would like to know, can I use this with trainer API ?
Actually I want, to train the model on dataset using streaming mode. Where the trainer API download automatically, chanks or batch etc and tokenize and train and so on iteratively. By doing this I will save my ram.

lhoestq · November 3, 2023, 1:51pm

You can pass your chunk and tokenize function to your streaming dataset using .map(), and then pass the dataset to the Trainer. The chunking and tokenization will happen iteratively during training

Imran1 · November 3, 2023, 2:05pm

Streaming=True not support map.

lhoestq · November 3, 2023, 2:18pm

Actually it does ! see https://huggingface.co/docs/datasets/v2.14.5/en/stream#map

Imran1 · November 3, 2023, 2:47pm

Thank you,

Topic		Replies	Views
Big text dataset loading for training 🤗Datasets	2	98	May 7, 2025
Best practices for a large dataset 🤗Datasets	7	1345	May 6, 2025
How do i load part of the data set Beginners	3	86	May 5, 2025
Use load dataset to load a sample of the dataset 🤗Datasets	3	1264	May 24, 2021
How to use Huggingface Trainer streaming Datasets without wrapping it with torchdata's IterableWrapper? 🤗Datasets	1	4578	October 30, 2022

How to load large dataset with streaming mode and prepare for training?

Related topics