Finetuning quantised llama-2 with LoRA

I have modified https://huggingface.co/blog/4bit-transformers-bitsandbytes to work with llama-2. It works, but I have some questions.

  1. When the training data is created, the function data.map() is called, and it defaults to a batch_size of 1000. The code goes like this:
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

The dataset consists of ~2000 quotations.
What does 1000 refers to in this context? 1000 tokens, or 1000 quotations?

  1. When training with transformers.Trainer(), configuration is done via transformers.TrainingArguments(), which in turn accepts the argument per_device_train_batch_size (the default is 8, according to https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html)
    What is the relation between per_device_train_batch_size the batch_size argument of map()?

  2. I have 6 x GTX 1060 and while training appears to be successful (at least checkpoints are saved and the model.save() generates output), the information about the loss during training looks weird, e.g:

{'loss': 10.3735, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 11.8882, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 10.3735, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 11.6371, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 10.3735, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 10.3735, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 11.8719, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'train_runtime': 107.6853, 'train_samples_per_second': 0.186, 'train_steps_per_second': 0.186, 'train_loss': 3.844556379318237, 'epoch': 0.01}

The scripted I used in reproduced in full below:

#!/usr/bin/env python
# coding: utf-8

# Remember to run this document with a python3.11 kernel, since pip by default installs for python3.11 on my system.
# based on https://huggingface.co/blog/4bit-transformers-bitsandbytes

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "TheBloke/Llama-2-7b-chat-fp16"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

# Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model

# This configuration is for llama-2, in particular the target_modules

config = LoraConfig(
r=8, # dimension of the updated matrices
lora_alpha=32, # parameter for scaling
target_modules=["q_proj", "up_proj", "o_proj", "k_proj", "down_proj", "gate_proj", "v_proj"],
lora_dropout=0.1, # dropout probability for layers
bias="none",
task_type="CAUSAL_LM",
)

model = get_peft_model(model,config)
print_trainable_parameters(model)

# Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

# Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

import transformers

# needed for gpt-neo-x tokenizer, but is it also needed for the llama tokenizer?
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1, # number of forward steps before running a backward step
        warmup_steps=2,
        save_steps=10,        
        max_steps=20,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

trainer.save_model("my_custom_LoRA_trained_model")

Did the finetuned model work when you ran it? Did it actually improve the performance for this dataset compared to the base model? How many steps did you train it for?