Llama2 fine-tunning with PEFT QLora and testing the model

hi All, @philschmid , I hope you are doing well. Sorry for fine tuning llama2, I create csv file with the Alpaca structure which has text column including ### instruction ### input ### response, for fine tuning the model I am confused which method with PEFT and QLora should I use, I am confused with many codes, would you please refer me to any code that is right for fine tuning with alpaca structure, and saving and inference for testing the model? In some code I saw they did tokenizer truncate and padding and refer label to -100 and in other no preprocessing is done. I appreciate your help. Many thanks.

2 Likes

I suggest using llama recipes repo from meta.

Iā€™d recommend checking out the official example scripts:

2 Likes

@nielsr all the examples seem to load a model into a single GPU before fine-tuning it. I found that using Deepspeed only the training is done in parallel, but the initial model needs to first fit to the single GPU. Any examples you are aware of that show how to fine tune a model that is loaded across GPUs? Thanks.

@nielsr , many thanks for your help and tips. sorry, I canā€™t load my data in hugging face can I read it directly in the code from local CSV file? including one column names ā€œtextā€ which is concatenation of ### Human: . ### Assistant? for inference is ther available code ?

@aytugkaya , have you used multiple GPUs? is it possible to share your code with me if it is multiple GPUS?

No, single Gpu only.

Hello! I initiated a model and tokenization to train on a T4 GPU using:
//
model_id=ā€œmeta-llama/Llama-2-7b-hfā€
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,load_in_8bit=True, device_map=ā€˜autoā€™, torch_dtype=torch.float16)
//
The configuration I used:
//
lora_config=LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.05,
target_modules = [ā€œq_projā€, ā€œv_projā€]
)
config = {
ā€˜lora_configā€™: lora_config,
ā€˜learning_rateā€™: 1e-4,
ā€˜num_train_epochsā€™: 1,
ā€˜gradient_accumulation_stepsā€™: 4,
ā€˜per_device_train_batch_sizeā€™: 1
}
training_args = TrainingArguments(
output_dir=output_dir,
overwrite_output_dir=True,
fp16=True, # Use BF16 if available
# logging strategies
logging_dir=f"{output_dir}/logs",
logging_strategy=ā€œstepsā€,
logging_steps=10,
save_strategy=ā€œnoā€,
optim=ā€œadamw_torch_fusedā€,
max_steps=total_steps if enable_profiler else -1,
**{k:v for k,v in config.items() if k != ā€˜lora_configā€™}
)
model.save_pretrained(ā€œ/content/drive/MyDrive/Colab Notebooks/llama2/saved_modelā€)
//
To load the trained model and use it, I did:
//
model = LlamaForCausalLM.from_pretrained(
model_id,
return_dict=True,
load_in_8bit=True,
device_map=ā€œautoā€,
low_cpu_mem_usage=True,
)
model = PeftModel.from_pretrained(model, peft_model,is_trainable=True, torch_dtype=torch.float16)
//
The inference is working and i donĀ“t want to load checkpoint.
When I try to do new fine-tuning with the trained model, I encounter this error:
//
OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.75 GiB total capacity; 13.34 GiB already allocated; 128.81 MiB free; 13.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
//
How can i do it?

Iā€™ve got it now!
The first time I fine-tuned, I did:
//
model = prepare_model_for_int8_training(model)
model = get_peft_model(model, peft_config)
//
When I loaded the model to continue training::
//
peft_model=ā€œ/content/drive/MyDrive/Colab Notebooks/llama2/saved_model2ā€
model_id=ā€œmeta-llama/Llama-2-7b-hfā€
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = LlamaForCausalLM.from_pretrained(
model_id,
return_dict=True,
load_in_8bit=True,
device_map=ā€œautoā€,
low_cpu_mem_usage=True,
)
model = prepare_model_for_int8_training(model) # <<<<---- I forgot this part
model = PeftModel.from_pretrained(model, peft_model,is_trainable=True, torch_dtype=torch.float16)
//

hello,@dametodata are you using multiple gpus? or one. I am aftering a code which work with multiple defined gpus

Hello! I used one!

Hi, iā€™m following the sft.py example to fine tune the meta-llama/Llama-2-7b-chat-hf with this dataset mlabonne/guanaco-llama2-1k Ā· Datasets at Hugging Face. i turned on load_in_4bits and perf and fine tuned the model for 30 epochs. the loss showing in the end has reached 0.05 ish. However, when I load this saved model and do inference, I always got same results as the vanilla one. I donā€™t know whatā€™s the issue cus the training looks alright. Can someone shed some lights on this issue? thanks

1 Like

Hi,

An update here is that we now have scripts for both supervised fine-tuning (SFT) and DPO (direct preference optimization) in the Alignment Notebook repository: alignment-handbook/scripts at main Ā· huggingface/alignment-handbook Ā· GitHub.

These scripts support multi-GPU fine-tuning using DeepSpeed.

2 Likes

Veryyyyy niceee thankss