Llama2 fine-tunning with PEFT QLora and testing the model

SUNM · August 13, 2023, 8:50am

hi All, @philschmid , I hope you are doing well. Sorry for fine tuning llama2, I create csv file with the Alpaca structure which has text column including ### instruction ### input ### response, for fine tuning the model I am confused which method with PEFT and QLora should I use, I am confused with many codes, would you please refer me to any code that is right for fine tuning with alpaca structure, and saving and inference for testing the model? In some code I saw they did tokenizer truncate and padding and refer label to -100 and in other no preprocessing is done. I appreciate your help. Many thanks.

aytugkaya · August 13, 2023, 11:49pm

I suggest using llama recipes repo from meta.

nielsr · August 14, 2023, 8:56am

I’d recommend checking out the official example scripts:

TRL: https://github.com/huggingface/trl/blob/main/examples/scripts/sft_trainer.py. This script illustrates how to do supervised fine-tuning (SFT) on a dataset of (human instruction, completion) pairs. It can optionally leverage LoRa instead of full fine-tuning in case you pass --use_peft to the script. See this post for an example on how to call the script. Do note that it will train your model on both instructions and completions by default (as it uses DataCollatorForLanguageModeling by default). You can alternatively use the DataCollatorForCompletionOnlyLM class to train on completions only. If you pass in both --load_in_4bit and --use_peft, then you’re doing QLoRa (quantized LoRa, i.e. LoRa on a frozen quantized LLM).
PEFT itself also has example scripts and notebooks: https://github.com/huggingface/peft/tree/main/examples/causal_language_modeling.

ilija-ja · August 17, 2023, 9:14pm

@nielsr all the examples seem to load a model into a single GPU before fine-tuning it. I found that using Deepspeed only the training is done in parallel, but the initial model needs to first fit to the single GPU. Any examples you are aware of that show how to fine tune a model that is loaded across GPUs? Thanks.

SUNM · August 18, 2023, 6:27am

@nielsr , many thanks for your help and tips. sorry, I can’t load my data in hugging face can I read it directly in the code from local CSV file? including one column names “text” which is concatenation of ### Human: . ### Assistant? for inference is ther available code ?

SUNM · August 21, 2023, 7:31am

@aytugkaya , have you used multiple GPUs? is it possible to share your code with me if it is multiple GPUS?

aytugkaya · August 21, 2023, 11:25am

No, single Gpu only.

dametodata · August 21, 2023, 12:21pm

Hello! I initiated a model and tokenization to train on a T4 GPU using:
//
model_id=“meta-llama/Llama-2-7b-hf”
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,load_in_8bit=True, device_map=‘auto’, torch_dtype=torch.float16)
//
The configuration I used:
//
lora_config=LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.05,
target_modules = [“q_proj”, “v_proj”]
)
config = {
‘lora_config’: lora_config,
‘learning_rate’: 1e-4,
‘num_train_epochs’: 1,
‘gradient_accumulation_steps’: 4,
‘per_device_train_batch_size’: 1
}
training_args = TrainingArguments(
output_dir=output_dir,
overwrite_output_dir=True,
fp16=True, # Use BF16 if available
# logging strategies
logging_dir=f"{output_dir}/logs",
logging_strategy=“steps”,
logging_steps=10,
save_strategy=“no”,
optim=“adamw_torch_fused”,
max_steps=total_steps if enable_profiler else -1,
**{k:v for k,v in config.items() if k != ‘lora_config’}
)
model.save_pretrained(“/content/drive/MyDrive/Colab Notebooks/llama2/saved_model”)
//
To load the trained model and use it, I did:
//
model = LlamaForCausalLM.from_pretrained(
model_id,
return_dict=True,
load_in_8bit=True,
device_map=“auto”,
low_cpu_mem_usage=True,
)
model = PeftModel.from_pretrained(model, peft_model,is_trainable=True, torch_dtype=torch.float16)
//
The inference is working and i don´t want to load checkpoint.
When I try to do new fine-tuning with the trained model, I encounter this error:
//
OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.75 GiB total capacity; 13.34 GiB already allocated; 128.81 MiB free; 13.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
//
How can i do it?

dametodata · August 21, 2023, 1:04pm

I’ve got it now!
The first time I fine-tuned, I did:
//
model = prepare_model_for_int8_training(model)
model = get_peft_model(model, peft_config)
//
When I loaded the model to continue training::
//
peft_model=“/content/drive/MyDrive/Colab Notebooks/llama2/saved_model2”
model_id=“meta-llama/Llama-2-7b-hf”
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = LlamaForCausalLM.from_pretrained(
model_id,
return_dict=True,
load_in_8bit=True,
device_map=“auto”,
low_cpu_mem_usage=True,
)
model = prepare_model_for_int8_training(model) # <<<<---- I forgot this part
model = PeftModel.from_pretrained(model, peft_model,is_trainable=True, torch_dtype=torch.float16)
//

SUNM · August 22, 2023, 8:26am

hello,@dametodata are you using multiple gpus? or one. I am aftering a code which work with multiple defined gpus

dametodata · August 22, 2023, 10:51am

Hello! I used one!

ysyyork · December 4, 2023, 1:25am

Hi, i’m following the sft.py example to fine tune the meta-llama/Llama-2-7b-chat-hf with this dataset mlabonne/guanaco-llama2-1k · Datasets at Hugging Face. i turned on load_in_4bits and perf and fine tuned the model for 30 epochs. the loss showing in the end has reached 0.05 ish. However, when I load this saved model and do inference, I always got same results as the vanilla one. I don’t know what’s the issue cus the training looks alright. Can someone shed some lights on this issue? thanks

nielsr · December 21, 2023, 10:05am

Hi,

An update here is that we now have scripts for both supervised fine-tuning (SFT) and DPO (direct preference optimization) in the Alignment Notebook repository: alignment-handbook/scripts at main · huggingface/alignment-handbook · GitHub.

These scripts support multi-GPU fine-tuning using DeepSpeed.

SUNM · December 21, 2023, 10:06am

Veryyyyy niceee thankss

Topic		Replies	Views
Reduced inference f1 score with QLoRA finetuned model Intermediate	1	881	September 6, 2023
LoRA vs QLoRA finetuning performance on llama2 🤗Transformers	0	2838	September 4, 2023
Fine tune a finetuned model Beginners	1	563	December 16, 2024
Bad Performance Finetuning Llama Chat and Instruct Models on GSM8K Beginners	5	1105	December 5, 2024
`get_peft_model` or `model.add_adapter` Beginners	2	1168	February 17, 2025

Llama2 fine-tunning with PEFT QLora and testing the model

Related topics