Fine tuning for Llama2 based model with LoftQ quantization

I am attempting to fine tune a Llama2 based model for sequence classification with LoftQ quantization, but it occurs error when it trains.
How can I get the model training to run successfully?

[Model] “elyza/ELYZA-japanese-Llama-2-7b”
[Error]

RuntimeError                              Traceback (most recent call last)
<ipython-input-14-b0ee58b0a570> in <cell line: 31>()
     29 )
     30 
---> 31 trainer.train()
---
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py in forward(self, input)
    112 
    113     def forward(self, input: Tensor) -> Tensor:
--> 114         return F.linear(input, self.weight, self.bias)
    115 
    116     def extra_repr(self) -> str:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (22560x4096 and 1x1)

Reproduction codes are following:

# Install following libraries
# pip install torch transformers datasets bitsandbytes accelerate peft

import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from accelerate import Accelerator
from transformers import AutoModelForSequenceClassification, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training, LoftQConfig
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load dataset
dataset = load_dataset("yelp_review_full", split="train[:1%]")
dataset = dataset.train_test_split(test_size=0.2)
dataset = dataset.rename_column("label", "labels")

model_name = "elyza/ELYZA-japanese-Llama-2-7b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "right"

def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True, max_length=4096, return_tensors='pt')

tokenized_datasets = dataset.map(tokenize_function, batched=True)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

loftq_config = LoftQConfig(loftq_bits=4)
lora_config = LoraConfig(
    init_lora_weights="loftq",
    loftq_config=loftq_config,
    r=16,
    lora_alpha=8,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_CLS
)

# LoftQ quantization
accelerator = Accelerator()
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5, quantization_config=bnb_config, device_map=device)
model = get_peft_model(model, lora_config)

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted', zero_division=0)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

print(model)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    save_total_limit=1,
    dataloader_pin_memory=False,
    evaluation_strategy="steps",
    logging_steps=50,
    logging_dir='./logs'
)

data_collator=DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

trainer.train()   # Error occured here

1 Like

I believe that you have quantized your base model via model = AutoModelForSequenceClassification.from_pretrained(..., quantization_config =bnb_config, ...) before initializing LoftQ. Based on the LoftQ documentation, using LoftQConfig should be sufficient for quantization.

Another reason would be due to a mismatch in model architecture. I’m not sure about your pretrained model, but LoftQ doesn’t work for GPT2 currently because it expects the last layer to be nn.Linear() instead of nn.Conv1d().

In that sense, maybe you can remove LoftQ and just quantize the base model directly?

EDIT: According to the original LoftQ repo, it seems that they initialized a bnb_config but not a loftq_config. There seems to be a discrepancy in the documentations since this is a relatively new PEFT technique.

I have a similar issue to the one described by Ossan but I am struggling to understand your response - any further clarity would be greatly appreciated!

I’m very sorry, but I’m not exactly familiar with the LoftQ technique. I was trying to nitpick the part of the code whereby the model is directly quantized to 4-bit via bnb_config, although it is already done by LoftQConfig under the hood. As mentioned in the documentations:

from peft import LoftQConfig, LoraConfig, get_peft_model

base_model = AutoModelForCausalLM.from_pretrained(...)  # don't quantize here
loftq_config = LoftQConfig(loftq_bits=4, ...)           # set 4bit quantization
lora_config = LoraConfig(..., init_lora_weights="loftq", loftq_config=loftq_config)
peft_model = get_peft_model(base_model, lora_config)

However, when using @ossan03’s code without the bnb_config, CUDA will most likely run out of memory if you are using the free version of Google Colab, as the quantization is not done on the base model. The pretrained base model can be initialized using 16-bit for the inference data type, but I would still have the same memory issues when trying to initialize the LoRA config via get_peft_model.

You can try finetuning using this instead. I still don’t get why we cannot directly quantize the base model though. (Note: LoftQ has to be applied to a full-precision pre-trained weights before we can initialize LoftQ and finetune the model.)

The above has nothing to do with the RuntimeError though. I reproduced the error using @ossan03’s code and changed the following to make it work (note that I did not split the dataset into batches):

  • Transpose matrices input_ids and attention_mask in dataset. If you are using batched = True, ensure that you set the batch_size for dataset.map() and TrainingArguments() to be the same, and mapping to be data[key].view(data[key].shape[0], -1, 1). I also added dataset.with_format("torch") as tensors will be always be converted to lists when dataset.map() is executed. In my case, per_device_train_batch_size = 1 (parameter for TrainingArguments()).
def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True, max_length=4096, return_tensors='pt')

tokenized_datasets = dataset.map(tokenize_function, remove_columns = dataset["train"].column_names).with_format("torch")
tokenized_datasets = tokenized_datasets.map(lambda data: {key: data[key].view(-1, 1) for key in data}).with_format("torch")
  • For TrainingArguments(), set group_by_length = True. You should also set the learning_rate, lr_scheduler and optim.

I have not tested on batch_size > 1, but it should work similarly.

Thank you for your detailed response - unfortunately I am still unable to run code on my end still due matrix multiplication error after following your advice. Note I have Collab pro so I am no longer too concerned about the bnb_config etc., I just feel my input id and attention masks are for some reason still not formatted correctly.

Can you share your full code that ran successfully so I can see exact dimensions and format of the test and train set? Thanks

@Tbritten99 yea my bad…I couldn’t find what else I edited because I was in a rush at that time and I deleted my code for it. I attempted to reproduce what I did previously but to no avail - I get one of the following errors after changing the dimensions of input_ids and attention_masks: mat1 and mat2 shapes cannot be multiplied, too many values to unpack (expected 3) (custom 4d attention_mask as transformers .forward() argument · Issue #27493 · huggingface/transformers · GitHub), and Incorrect 4D attention_mask shape: (1, 1, 198, 1); expected: (1, 1, 1, 1) (the value for the incorrect shape should depend on the first dataset entry that the trainer visits). It should also be noted that I accidentally removed the labels column while mapping the transposed input_ids and attention_mask for the tokenized_dataset in the code snippet that I have written previously, so do put the labels back in.

In all honesty, the easiest method is to use the LoftQ adapters w/ models already available such as Mistral, BART and Llama2, or the next best alternative is to apply LoftQ adapter on the pretrained model that is not quantized yet and save it under the loftq_init folder. Unfortunately I can only recommend these solutions to you as I’m only working on a Colab Free account + my personal GPU that is of relatively limited VRAM.

Whatever caused this error regarding the matrix multiplication is due to the fact that encoder-decoder models like LLaMA have input_length = 1.

Sorry for the late reply.

I tried many things based on @DenseLance’s advice, but training with LoftQ quantization did not work.

So I gave up on LoftQ and used only regular LoRA for quantization and was able to train it successfully.

The code that works is shown below, although it uses LoRA instead of LoftQ.

Thank you very much.

# Install following libraries
# pip install torch transformers datasets bitsandbytes accelerate peft

import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from accelerate import Accelerator
from transformers import AutoModelForSequenceClassification, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training, LoftQConfig
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, DataCollatorWithPadding
from transformers import LlamaTokenizer, set_seed

device = "cuda:0" if torch.cuda.is_available() else "cpu"

seed = 1000
set_seed(seed)

# Load dataset
dataset = load_dataset("yelp_review_full", split="train[:1%]")
dataset = dataset.train_test_split(test_size=0.2, seed=seed)
dataset = dataset.rename_column("label", "labels")

model_name = "elyza/ELYZA-japanese-Llama-2-7b"
# model_name = "mistralai/Mistral-7B-v0.1"

if any(k in model_name for k in ("gpt", "opt", "bloom")):
    padding_side = "left"
else:
    padding_side = "right"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side=padding_side)
# tokenizer = LlamaTokenizer.from_pretrained(model_name)
if getattr(tokenizer, "pad_token_id") is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True, max_length=4096, return_tensors='pt')

tokenized_datasets = dataset.map(tokenize_function, batched=True, batch_size=1)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=8,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_CLS
)

# LoRA quantization
accelerator = Accelerator()
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5, quantization_config=bnb_config, device_map=device)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted', zero_division=0)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

print(model)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    save_total_limit=1,
    dataloader_pin_memory=False,
    evaluation_strategy="epoch",
    logging_dir='./logs',
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding="longest")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

trainer.train()

Yep, LoftQ is relatively new and the method itself is still in the works for other models than the ones the authors have mentioned in their papers. It’s unfortunate, but LoRA and QLoRA will still remain as the best methods for LLM finetuning.