Why do you need to re-upcast the norm layers of HF falcon to fb32?

brando · July 9, 2023, 6:17am

I saw these lines:

bnb_4bit_compute_dtype=torch.float16,
...
optim = "paged_adamw_32bit"
...
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

in the falcon tutorial. These are confusing me. Based on the original QLoRA tutorial, they use 4 bit model + during training they use 16 brain float (not normal float nor float 32). See equation 5:

YBF16 = XBF16doubleDequant(c_FP32_1, c_k-bit_2, WNF4) + XBF16L_BF16_1 L_BF16_2
doubleDequant(c_FP32_1, c_k-bit_2, W_k-bit) = dequant(dequant(c_FP32_1, c_k-bit_2), W_4bit) = WBF16

which says the computation is done in brain float 16 (bf16):

I’m puzzled. Why are we:

Using an optimizer in 32 floating point when the paper (and the bnb code too, bnb_4bit_compute_dtype=torch.float16, ok normal float but still 16, that’s likely a bug) says brain float 16 for computation?
Why is the norm layers up casted to float point 32 when the paper and bnb code use brain float 16 (again code likely has a bug uses fb16 not bf16)?
The model is loaded from a 16 brain float, so why upcasting the norm layer to 32 floating point? Shouldn’t there be a datatype error anyway?

I’m puzzled what is going on and why the code doesn’t just crash at runtime anyway.

Ref: Google Colab

All Code

# -*- coding: utf-8 -*-
"""Falcon-Guanaco.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1BiQiw31DT7-cDp1-0ySXvvhzqomTdI-o

## Finetune Falcon-7b on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the recent Falcon-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.
"""

!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

"""## Dataset

For our experiment, we will use the Guanaco dataset, which is a clean subset of the OpenAssistant dataset adapted to train general purpose chatbots.

The dataset can be found [here](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
"""

from datasets import load_dataset

dataset_name = "timdettmers/openassistant-guanaco"
dataset = load_dataset(dataset_name, split="train")

"""## Loading the model

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it. Let's get started!
"""

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "ybelkada/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

"""Let's also load the tokenizer below"""

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

"""Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer."""

from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

"""## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.
"""

from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)

"""Then finally pass everthing to the trainer"""

from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

"""We will also pre-process the model by upcasting the layer norms in float 32 for more stable training"""

for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

"""## Train the model

Now let's train the model! Simply call `trainer.train()`
"""

trainer.train()

"""During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.
"""

cross:

Topic		Replies	Views
How to properly UPCAST the model weights to float32? 🤗Transformers	2	369	April 11, 2024
Understanding how changing bnb_4bit_compute_dtype affects outputs 🤗Transformers	1	4104	February 10, 2024
Changing bnb_4bit_compute_dtype Beginners	0	82	July 18, 2024
Why does hugging face falcon model use mode.config.use_cache = False, why wouldn't it want to have the decoder re-use computations for fine-tuning? Models	7	2586	July 19, 2023
The role of the bf16 arguments in SFTConfig 🤗Transformers	0	241	July 25, 2024

Why do you need to re-upcast the norm layers of HF falcon to fb32?

All Code

Related topics