Error when running code from recently-posted Deeplearning.ai video that uses HF libraries (among others)

I’ve been working through this video lecture by Deeplearning.ai Building with Instruction-Tuned LLMs: A Step-by-Step Guide - YouTube today, and it utilizes a number of HF and HF-related libraries, but I ran into an error when I tried to run the first notebook covered in the video lecture. Frustrated, I tried downloading the notebook onto my local machine as a python file (has a GPU and CUDA, I’ve trained models using accelerate on it before) and it still errors out. Something about matrix dimensions not lining up during a matmul.

I don’t know what the preferred platform for sharing issues like this is, and I don’t know what the policy on third party links is, so here’s a copy-paste of the error and code that produced it. Most of the libraries are past GPT-4’s training cutoff and DL.ai doesn’t have a discord, so this is the only place I can ask. Any assistance appreciated.

Is this a problem with the code or libraries, and how would one use these libraries together properly in a way that doesn’t blow up?

Error:

Traceback (most recent call last):
  File "openllama-tuning-fix-attempt.py", line 156, in <module>
    supervised_finetuning_trainer.train()
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
    outputs = self.model(
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
    layer_outputs = decoder_layer(
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nameomitted/miniconda3/envs/mlp/lib/python3.8/site-packages/peft/tuners/lora.py", line 565, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (331x4096 and 1x8388608)

Code:

# Literally converted from the notebook where the part about pushing to the hub commented out (and the model checkpoint being changed to something that actually exists) are the only changes
"""Let's look at our dataset to get an idea of what we're working with!"""

from datasets import load_dataset

dbricks_15k_dataset_base = load_dataset("databricks/databricks-dolly-15k")

"""Let's check out some brief stats about our dataset:"""

dbricks_15k_dataset_base

import matplotlib.pyplot as plt
from datasets import load_dataset

def plot_sequence_lengths(dataset_obj):

    # Initialize a list to store the sequence lengths
    sequence_lengths = []

    # list of indices that are too long
    too_long = []

    # Loop over the dataset and get the lengths of text sequences
    for idx, example in enumerate(dataset_obj["train"]):
        sequence_lengths.append(len(example['instruction']) + len(example["context"]) + len(example["response"]))
        if sequence_lengths[idx] > 2200:
          too_long.append(idx)

    # Plot the histogram
    plt.hist(sequence_lengths, bins=30)
    plt.xlabel('Sequence Length')
    plt.ylabel('Count')
    plt.title('Distribution of Text Sequence Lengths')
    plt.show()

    return too_long

indexes_to_drop = plot_sequence_lengths(dbricks_15k_dataset_base)

len(indexes_to_drop)

dbricks_15k_dataset_reduced = dbricks_15k_dataset_base["train"].select(
    i for i in range(len(dbricks_15k_dataset_base["train"])) if i not in set(indexes_to_drop)
)

dbricks_15k_dataset_reduced

dbricks_15k_dataset_prepared = dbricks_15k_dataset_reduced.train_test_split(test_size=0.1)

indexes_to_drop = plot_sequence_lengths(dbricks_15k_dataset_prepared)

dbricks_15k_dataset_prepared

"""Before we can begin training, we need to set up a few helper functions to ensure our dataset is parsed in the correct format and we save our PEFT adapters!"""

def formatting_func(example):
  if example.get("context", "") != "":
      input_prompt = (f"Below is an instruction that describes a task, paired with an input that provides further context. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Input: \n"
      f"{example['context']}\n\n"
      f"### Response: \n"
      f"{example['response']}")

  else:
    input_prompt = (f"Below is an instruction that describes a task. "
      "Write a response that appropriately completes the request.\n\n"
      "### Instruction:\n"
      f"{example['instruction']}\n\n"
      f"### Response:\n"
      f"{example['response']}")

  return {"text" : input_prompt}

formatted_dataset = dbricks_15k_dataset_prepared.map(formatting_func)

formatted_dataset

formatted_dataset["train"][2]["text"]

"""Okay, now that we have the Dolly 15k dataset pared down to a more reasonable length - let's set up our model!

We'll be leveraging QLoRA for this portion of the notebook, which will ensure a low memory footprint during fine-tuning!

- [Paper](https://arxiv.org/pdf/2305.14314.pdf)
- [Blog](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
"""

import torch
import transformers
from peft import LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

model_id = "openlm-research/open_llama_7b"

qlora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
)

from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

print(base_model)

"""Now, let's set up our SupervisedFineTuningTrainer and let it rip!

More information on the SFTTrainer is available here:

- [HF Documentation](https://huggingface.co/docs/trl/main/en/sft_trainer)
- [Repository](https://github.com/lvwerra/trl/blob/main/trl/trainer/sft_trainer.py)


"""

from trl import SFTTrainer

supervised_finetuning_trainer = SFTTrainer(
    base_model,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        max_steps=5000,
        output_dir="./SFTOpenLM-Dolly15k",
        optim="paged_adamw_8bit",
        fp16=True,
    ),
    tokenizer=tokenizer,
    peft_config=qlora_config,
    dataset_text_field="text",
    max_seq_length=512
)

supervised_finetuning_trainer.train()


# Everything below here commented out, because I wanted to get the model training before I added in code that pushed it to the hub

# from huggingface_hub import notebook_login

# notebook_login()

# base_model.push_to_hub("FourthBrainGenAI/FB-DLAI-Instruct-tune-v3", private=True)

# tokenizer.push_to_hub("FourthBrainGenAI/FB-DLAI-Instruct-tune-v3")

# from peft import get_peft_model
# import torch
# import transformers
# from peft import LoraConfig
# from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
# from transformers import AutoTokenizer

# lora_config = LoraConfig.from_pretrained("FourthBrainGenAI/FB-DLAI-Instruct-tune-v3")
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

# tokenizer = AutoTokenizer.from_pretrained("FourthBrainGenAI/FB-DLAI-Instruct-tune-v3")
# model = AutoModelForCausalLM.from_pretrained(
#     lora_config.base_model_name_or_path,
#     quantization_config=bnb_config,
#     device_map={"":0})

# model = get_peft_model(model, lora_config)

# from IPython.display import display, Markdown

# def make_inference(instruction, context = None):
#   if context:
#     prompt = f"Below is an instruction that describes a task, paired with an input that provides further context.\n\n### Instruction: \n{instruction}\n\n### Input: \n{context}\n\n### Response: \n"
#   else:
#     prompt = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction: \n{instruction}\n\n### Response: \n"
#   inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to("cuda:0")
#   outputs = base_model.generate(**inputs, max_new_tokens=100)
#   display(Markdown((tokenizer.decode(outputs[0], skip_special_tokens=True))))
#   outputs = model.generate(**inputs, max_new_tokens=50)
#   print("---- NON-INSTRUCT-TUNED-MODEL ----")
#   display(Markdown((tokenizer.decode(outputs[0], skip_special_tokens=True))))

# make_inference("Convert the text into a dialogue between two characters.", "Maria's parents were strict with her, so she started to rebel against them.")

# make_inference("Explain in simple terms how the attention mechanism of a transformer model works")

# make_inference("Identify the odd one out and explain your choice.", "Orange, Green, Airplane.")

Crossposted from the Discord since I thought this might be a bit long for that more informal forum.