hi All, @philschmid , I hope you are doing well. Sorry for fine tuning llama2, I create csv file with the Alpaca structure which has text column including ### instruction ### input ### response, for fine tuning the model I am confused which method with PEFT and QLora should I use, I am confused with many codes, would you please refer me to any code that is right for fine tuning with alpaca structure, and saving and inference for testing the model? In some code I saw they did tokenizer truncate and padding and refer label to -100 and in other no preprocessing is done. I appreciate your help. Many thanks.
I suggest using llama recipes repo from meta.
Iād recommend checking out the official example scripts:
- TRL: https://github.com/huggingface/trl/blob/main/examples/scripts/sft_trainer.py. This script illustrates how to do supervised fine-tuning (SFT) on a dataset of (human instruction, completion) pairs. It can optionally leverage LoRa instead of full fine-tuning in case you pass
--use_peft
to the script. See this post for an example on how to call the script. Do note that it will train your model on both instructions and completions by default (as it usesDataCollatorForLanguageModeling
by default). You can alternatively use the DataCollatorForCompletionOnlyLM class to train on completions only. If you pass in both --load_in_4bit and --use_peft, then youāre doing QLoRa (quantized LoRa, i.e. LoRa on a frozen quantized LLM). - PEFT itself also has example scripts and notebooks: https://github.com/huggingface/peft/tree/main/examples/causal_language_modeling.
@nielsr all the examples seem to load a model into a single GPU before fine-tuning it. I found that using Deepspeed only the training is done in parallel, but the initial model needs to first fit to the single GPU. Any examples you are aware of that show how to fine tune a model that is loaded across GPUs? Thanks.
@nielsr , many thanks for your help and tips. sorry, I canāt load my data in hugging face can I read it directly in the code from local CSV file? including one column names ātextā which is concatenation of ### Human: . ### Assistant? for inference is ther available code ?
@aytugkaya , have you used multiple GPUs? is it possible to share your code with me if it is multiple GPUS?
No, single Gpu only.
Hello! I initiated a model and tokenization to train on a T4 GPU using:
//
model_id=āmeta-llama/Llama-2-7b-hfā
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,load_in_8bit=True, device_map=āautoā, torch_dtype=torch.float16)
//
The configuration I used:
//
lora_config=LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.05,
target_modules = [āq_projā, āv_projā]
)
config = {
ālora_configā: lora_config,
ālearning_rateā: 1e-4,
ānum_train_epochsā: 1,
āgradient_accumulation_stepsā: 4,
āper_device_train_batch_sizeā: 1
}
training_args = TrainingArguments(
output_dir=output_dir,
overwrite_output_dir=True,
fp16=True, # Use BF16 if available
# logging strategies
logging_dir=f"{output_dir}/logs",
logging_strategy=āstepsā,
logging_steps=10,
save_strategy=ānoā,
optim=āadamw_torch_fusedā,
max_steps=total_steps if enable_profiler else -1,
**{k:v for k,v in config.items() if k != ālora_configā}
)
model.save_pretrained(ā/content/drive/MyDrive/Colab Notebooks/llama2/saved_modelā)
//
To load the trained model and use it, I did:
//
model = LlamaForCausalLM.from_pretrained(
model_id,
return_dict=True,
load_in_8bit=True,
device_map=āautoā,
low_cpu_mem_usage=True,
)
model = PeftModel.from_pretrained(model, peft_model,is_trainable=True, torch_dtype=torch.float16)
//
The inference is working and i donĀ“t want to load checkpoint.
When I try to do new fine-tuning with the trained model, I encounter this error:
//
OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.75 GiB total capacity; 13.34 GiB already allocated; 128.81 MiB free; 13.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
//
How can i do it?
Iāve got it now!
The first time I fine-tuned, I did:
//
model = prepare_model_for_int8_training(model)
model = get_peft_model(model, peft_config)
//
When I loaded the model to continue training::
//
peft_model=ā/content/drive/MyDrive/Colab Notebooks/llama2/saved_model2ā
model_id=āmeta-llama/Llama-2-7b-hfā
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = LlamaForCausalLM.from_pretrained(
model_id,
return_dict=True,
load_in_8bit=True,
device_map=āautoā,
low_cpu_mem_usage=True,
)
model = prepare_model_for_int8_training(model) # <<<<---- I forgot this part
model = PeftModel.from_pretrained(model, peft_model,is_trainable=True, torch_dtype=torch.float16)
//
hello,@dametodata are you using multiple gpus? or one. I am aftering a code which work with multiple defined gpus
Hello! I used one!
Hi, iām following the sft.py example to fine tune the meta-llama/Llama-2-7b-chat-hf
with this dataset mlabonne/guanaco-llama2-1k Ā· Datasets at Hugging Face. i turned on load_in_4bits and perf and fine tuned the model for 30 epochs. the loss showing in the end has reached 0.05 ish. However, when I load this saved model and do inference, I always got same results as the vanilla one. I donāt know whatās the issue cus the training looks alright. Can someone shed some lights on this issue? thanks
Hi,
An update here is that we now have scripts for both supervised fine-tuning (SFT) and DPO (direct preference optimization) in the Alignment Notebook repository: alignment-handbook/scripts at main Ā· huggingface/alignment-handbook Ā· GitHub.
These scripts support multi-GPU fine-tuning using DeepSpeed.
Veryyyyy niceee thankss