I have modified https://huggingface.co/blog/4bit-transformers-bitsandbytes to work with llama-2. It works, but I have some questions.
- When the training data is created, the function data.map() is called, and it defaults to a
batch_size
of 1000. The code goes like this:
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
The dataset consists of ~2000 quotations.
What does 1000 refers to in this context? 1000 tokens, or 1000 quotations?
-
When training with
transformers.Trainer()
, configuration is done viatransformers.TrainingArguments()
, which in turn accepts the argumentper_device_train_batch_size
(the default is 8, according to https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html)
What is the relation betweenper_device_train_batch_size
thebatch_size
argument ofmap()
? -
I have 6 x GTX 1060 and while training appears to be successful (at least checkpoints are saved and the
model.save()
generates output), the information about the loss during training looks weird, e.g:
{'loss': 10.3735, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 11.8882, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 10.3735, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 11.6371, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 10.3735, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 10.3735, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 11.8719, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'train_runtime': 107.6853, 'train_samples_per_second': 0.186, 'train_steps_per_second': 0.186, 'train_loss': 3.844556379318237, 'epoch': 0.01}
The scripted I used in reproduced in full below:
#!/usr/bin/env python
# coding: utf-8
# Remember to run this document with a python3.11 kernel, since pip by default installs for python3.11 on my system.
# based on https://huggingface.co/blog/4bit-transformers-bitsandbytes
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "TheBloke/Llama-2-7b-chat-fp16"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
# Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.
from peft import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
from peft import LoraConfig, get_peft_model
# This configuration is for llama-2, in particular the target_modules
config = LoraConfig(
r=8, # dimension of the updated matrices
lora_alpha=32, # parameter for scaling
target_modules=["q_proj", "up_proj", "o_proj", "k_proj", "down_proj", "gate_proj", "v_proj"],
lora_dropout=0.1, # dropout probability for layers
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model,config)
print_trainable_parameters(model)
# Let's load a common dataset, english quotes, to fine tune our model on famous quotes.
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
# Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.
import transformers
# needed for gpt-neo-x tokenizer, but is it also needed for the llama tokenizer?
tokenizer.pad_token = tokenizer.eos_token
trainer = transformers.Trainer(
model=model,
train_dataset=data["train"],
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=1, # number of forward steps before running a backward step
warmup_steps=2,
save_steps=10,
max_steps=20,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit"
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
trainer.save_model("my_custom_LoRA_trained_model")