Imran1
November 2, 2023, 9:08am
1
I can load dataset with streaming mode, but I am confused, how to prepare for training to iteratively train the model on whole dataset.
If any one can provide a notebook so this will be very helpful.
@lhoestq
1 Like
What are you using for training ?
If you have your own training loop you can use a DataLoader
with the streaming dataset
1 Like
Imran1
November 3, 2023, 1:35am
3
Here is the complete code please check it
opened 06:18AM - 01 Nov 23 UTC
hi, i am try to SFT( fine tune ) zephyar model with ultrachat200k dataset. but i… t show cuda out of memory issues.
how to load dataset with streaming and prepare for training for each chank?
here is the code.
```
!pip install --upgrade "transformers" "datasets" "peft" "accelerate" "bitsandbytes" "safetensors" "trl" "wandb"
!pip install -U git+https://github.com/huggingface/trl.git@main
#!pip install -U flash-attn
!git config --global credential.helper store
!huggingface-cli login --token 'token' --add-to-git-credential
from datasets import load_dataset
dataset_base = load_dataset('HuggingFaceH4/ultrachat_200k',streaming=True)
dataset_base
def formatting_func(example):
instruction = '### Instruction:\n Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n'
input_prompt = f"### Prompt:\n{example['prompt']}\n\n"
# Check if there's an instruction and include it
if 'context' in example:
input_prompt += f"### Instruction:\n{example['context']}\n\n"
input_prompt += "### Conversation:\n"
for message in example['messages']:
input_prompt += f"{message['role']}: {message['content']}\n\n"
text = instruction + input_prompt
return {"text": text}
# Select the splits you want to format
splits_to_format = ['train_sft', 'test_sft']
# Apply the formatting function to the selected splits
for split_name in splits_to_format:
dataset_base[split_name] = dataset_base[split_name].map(formatting_func)
# Now, your 'train_sft' and 'test_sft' splits have been formatted using the 'formatting_func'.
# You can access them as follows:
train_sft = dataset_base['train_sft']
test_sft = dataset_base['test_sft']
print(train_sft[2]["text"])
import torch
import transformers
from peft import LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
model_id = "HuggingFaceH4/zephyr-7b-beta"
qlora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
bias="none",
task_type="CAUSAL_LM"
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
base_model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
trust_remote_code=True,
#use_flash_attention_2=True # using flesh attention v2
#use_auth_token=True,
)
base_model
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
from trl import SFTTrainer
supervised_finetuning_trainer = SFTTrainer(
base_model,
train_dataset=train_sft,
eval_dataset=test_sft,
args=transformers.TrainingArguments(
output_dir="sft_z",
max_steps=500,
logging_steps=10,
save_steps=10000,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=8,
gradient_checkpointing=False,
group_by_length=False,
learning_rate=1e-4,
lr_scheduler_type="cosine",
warmup_steps=100,
weight_decay=0.05,
optim="paged_adamw_8bit",
fp16=True,
remove_unused_columns=False,
run_name="sft_zephyar",
report_to="wandb",
),
tokenizer=tokenizer,
peft_config=qlora_config,
dataset_text_field="text",
max_seq_length=512,
neftune_noise_alpha=5
)
supervised_finetuning_trainer.train()
```
lhoestq
November 3, 2023, 11:13am
4
Your issue doesn’t seem to be related to the dataset, feel free to continue the discussion in your github issue
Imran1
November 3, 2023, 11:41am
5
My question is, how to iteratively train the model , if the dataset in streaming mode.
Can you provide any notebook, I just want to learn the concept/tricks etc.
lhoestq
November 3, 2023, 11:49am
6
You can find cod examples on how to use a streaming dataset in your own training loop here: Stream
It’s generally a good starting point if you want to adapt it to your use case
1 Like
Imran1
November 3, 2023, 12:08pm
7
Thank you. I would like to know, can I use this with trainer API ?
Actually I want, to train the model on dataset using streaming mode. Where the trainer API download automatically, chanks or batch etc and tokenize and train and so on iteratively. By doing this I will save my ram.
You can pass your chunk and tokenize function to your streaming dataset using .map()
, and then pass the dataset to the Trainer. The chunking and tokenization will happen iteratively during training
1 Like
Imran1
November 3, 2023, 2:05pm
9
Streaming=True not support map.
lhoestq
November 3, 2023, 2:18pm
10
1 Like