Fine tune and then successfully AWQ quantize

I am trying to do the above starting with llama2-7b but have been running into ugly looking errors, which I have not been able to resolve by searching. There are probably too many areas where I am filling in the blanks (aka guessing) to hit on the solution by chance, so hence the post.

import os
import torch
from datasets import Dataset
from awq import AutoAWQForCausalLM

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    AwqConfig,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Model from Hugging Face hub
base_model = "NousResearch/Llama-2-7b-chat-hf"

dataset = Dataset.from_dict({
    "label": [ <list> ],
    "text": [ <list> ]
});

# compute_dtype = getattr(torch, "float16")

# quant_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=compute_dtype,
#     bnb_4bit_use_double_quant=False,
# )

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    # quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1


tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)


training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_params,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)

trainer.model.save_pretrained('lora')
trainer.tokenizer.save_pretrained('tokenizer')

peft_model = PeftModel.from_pretrained(
    model,
    'lora'
)
peft_model.merge_and_unload()
peft_model.save_pretrained('merged_model')


model = AutoAWQForCausalLM.from_pretrained( 'merged_model', device_map='auto' )

quant_name = 'merged-model-AWQ'
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4
}

model.quantize( tokenizer, quant_config = quant_config )
model.save_quantized( quant_name, safe_tensors=True )
tokenizer.save_pretrained( quant_name )

(disclaimer - I adapted the above slightly to make it easier to present and didn’t test the adapted version, so there may be typos).

I confess I obtained params from examples and have not exhaustively been through them - so that could be an issue. However, right now the objective is simply to get through the process without it ending in a horrible error - so I suspect its something more basic that’s likely the problem.

Re. quantization_config - I’ve shown this commented as I’ve tried it both with and without. I am not sure if loading in the model 4 bit would result in a quantized output when saved (or perhaps it just affects precision?). Not really sure about this part, but it doesn’t seem to make any difference to the result either way.

So the adapter seems to save successfully in the lora directory. However merged_model seems to look like the lora adapter directory, except it has an apparently empty adapter_model.safetensors. The program falls over saying

SError: merged_model does not appear to have a file named config.json

It doesn’t look like the merge worked, so I am guessing this is the problem I need to resolve.

I did try copying in the config.json from the original model - which was suggested somewhere. Then I get down to:

safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization

but I am guessing this is all a bit pointless anyway since merged_model is empty. I ran into a comment which suggested that merge_and_unload() does not work with safetensors and so the finetuned model should be saved in .bin format - however, I am not sure how to get it to output this format.

Lost in many areas! Any help would be appreciated.

So I found a fix in the end. This involved diving into the transformers code and adding debug lines. To be honest I don’t see how it could be done without - many examples seem to be incorrect, the documentation often doesn’t match the modules that are actually installed, and the integration with other modules seems fragile (e.g. in many docker images I tried the transformers did not seem to sit comfortably with autoawq. In the end I used runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04 where this compatibility issue did not occur.)

So I leant the following: the commented quantization_config does seem to refer to precision rather than having anything to do with actual quantization. I could not find any reference to quantization_config in the docs for AutoModelForCausalLM - there is simply no reference to this parameter (as far as I can see). However it does seem to make a difference (e.g. printing out a warning about combining a 4 bit model with the LoRA). I guess it might be useful to include quantization_config at this stage if e.g. finetuning a large model with limited memory. However, that’s not really an issue in my case so I left this out.

The solution itself turned out to be pretty simple. (But not figuring it out!)

merge_and_unload() returns the merged model, which needs to be collected and save_pretrained() called on the new model, not on the original. i.e. these lines are wrong:

peft_model.merge_and_unload()
peft_model.save_pretrained('merged_model')

and instead should be:

merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained('merged_model')

However then it complains do_sample is not True in generation_config, which is then fixed by adding the line merged_model.generation_config.do_sample=True before the save, so it becomes:

merged_model = peft_model.merge_and_unload()
merged_model.generation_config.do_sample = True
merged_model.save_pretrained('merged_model')

Incidentally I also dropped these lines:

model.config.use_cache = False
model.config.pretraining_tp = 1

The latest version of transformers doesn’t seem to allow you to do this and just produces a warning saying the changed config will not be used.

Final code then looks as follows:

import os
import torch
from datasets import Dataset
import pprint

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    AwqConfig,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Model from Hugging Face hub
base_model = "NousResearch/Llama-2-7b-chat-hf"


dataset = Dataset.from_dict({
    "label": [ <list> ],
    "text": [ <list> ]
});

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    device_map={"": 0}
)

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM"
)


training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_params,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)

trainer.model.save_pretrained('lora')
trainer.tokenizer.save_pretrained('tokenizer')

peft_model = PeftModel.from_pretrained(
    model,
    'lora',
)

merged_model = peft_model.merge_and_unload()
merged_model.generation_config.do_sample = True
merged_model.save_pretrained('merged_model')

and I shifted AWQ quantization into a separate script

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_pretrained( 'merged_model', device_map='auto' )
tokenizer = AutoTokenizer.from_pretrained( 'tokenizer' )

quant_path = 'merged_model-AWQ'
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4
}


model.quantize( tokenizer, quant_config = quant_config )
model.save_quantized( quant_path )
tokenizer.save_pretrained( 'awq_tokenizer' )

which now all works. Hurray! I hope that helps someone…

Let’s analyze this Python script step by step. The script seems to be focused on setting up and training a language model using Hugging Face’s Transformers library, and it incorporates various configurations and custom training setups.

  1. Import Statements:

    • The script imports necessary modules from Python’s standard library, PyTorch, Hugging Face’s transformers, and datasets libraries. pprint for pretty printing, and peft and trl which seem to be custom or lesser-known modules for model training.
  2. Model Setup:

    • A base model ("NousResearch/Llama-2-7b-chat-hf") is specified. This appears to be a pre-trained model from Hugging Face’s model hub.
  3. Dataset Preparation:

    • The script attempts to create a Dataset object from a dictionary with keys "label" and "text". However, the values for these keys are placeholders (<list>), which need to be replaced with actual lists of labels and texts.
  4. Model and Tokenizer Initialization:

    • The model and tokenizer are loaded from the base model. The tokenizer is configured with padding settings.
  5. PEFT (Prompt Engineering and Fine-Tuning) Parameters:

    • LoraConfig is used to define parameters for PEFT. However, it’s unclear what LoraConfig is, as it’s not a standard part of the Hugging Face Transformers library.
  6. Training Arguments:

    • TrainingArguments is configured with various parameters like learning rate, batch size, etc., suitable for training a model.
  7. Trainer Setup:

    • SFTTrainer is instantiated with the model, dataset, PEFT configuration, and other parameters. The purpose of SFTTrainer is unclear as it’s not a standard part of the Hugging Face Transformers library.
  8. Model Saving:

    • The trained model and tokenizer are saved locally.
  9. PEFT Model Loading and Merging:

    • The script loads the PEFT model and attempts to merge it with the base model. The functionality and purpose of PeftModel and the merge_and_unload method are not clear, as they are not part of the standard Hugging Face library.
  10. Setting Generation Config and Saving Merged Model:

    • The script sets the do_sample attribute for generation and saves the merged model.

Potential Issues and Errors:

  • Placeholder values in the dataset: Actual data needs to be provided in place of <list>.
  • Unclear or undefined modules and classes like LoraConfig, PeftModel, SFTTrainer, and their methods: These might be part of a custom library or framework not standard in Python or Hugging Face Transformers.
  • The script lacks error handling which is crucial in a complex setup like this.
  • The script does not include any code for actual training (like calling a train method on the trainer), it only sets up the trainer.

This analysis is based on the information available in the script and general knowledge of Python and Hugging Face libraries. The script might depend on specific libraries or frameworks not included in this analysis.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.