Errors when trying to fine-tune OpenLLaMA using Trainer API

Hi all,

I’m currently trying to fine-tune the openlm-research/open_llama_3b model on my own custom dataset. My current code is:

model_name = "openlm-research/open_llama_3b_v2"
import os
import os.path
import random
from torch.utils.data import Dataset
import torch
from transformers import AutoTokenizer

# Set up tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Seed randomizer for reproducibility
random.seed(42)

class MyDataset(Dataset):
  def __init__(self, data: list[str]):
    self.data = data
  
  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx):
    return self.data[idx]

data_set_file_names = [fname for fname in os.listdir(".") if os.path.isfile(fname)]

# Tokenize each sample
all_content = []
for fname in data_set_file_names:
  with open(fname) as f:
    text = f.read()
    t = tokenizer(text, return_tensors="pt", padding='max_length', truncation=True)
    all_content.append(t)

# Shuffle the list (but seeded so it's replicatable)
random.shuffle(all_content)

# Split into training (60%), evaluation (20%), and test (20%)
training_end_len = int(len(all_content) * 0.6)
eval_test_end_len = int(len(all_content) * 0.2)
train_data = MyDataset(all_content[:training_end_len])
eval_data = MyDataset(all_content[training_end_len:training_end_len + eval_test_end_len])
test_data = MyDataset(all_content[training_end_len + eval_test_end_len:training_end_len + 2 * eval_test_end_len])

# Load the pre-trained model
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

# Set up training
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="trained_model")
trainer = Trainer(
    model=base_model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=train_data,
)
trainer.train()

Upon running this, I get the following error from inside the LLaMA model (added some print statements to try to understand what was going on with the error):

hi there
hidden_states.size() = torch.Size([8, 1, 2048, 3200])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py in forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, cache_position)
    634 print("hi there")
    635 print(f"{hidden_states.size() = }")
--> 636 bsz, q_len, _ = hidden_states.size()
    637
    638 query_states = self.q_proj(hidden_states)

ValueError: too many values to unpack (expected 3)

So I can see that the shape of hidden_states is not what it should be - it has a size of 4 (shape of 8, 1, 2048, 3200) when it should have a size of 3. Looks like…

  • 8 samples per batch (perhaps should correspond to bsz, for “batch size”?)
  • 1… something? maybe this part needs to get squeezed out?
  • 2048 position embeddings (perhaps corresponding to q_len, for “question/query length”?)
  • 3200 hidden size (presumably the part of the size getting ommitted via the _?)

My guess so far is that I’m making some mistake with how I’m loading and preprocessing my data set samples, but I don’t see (at the moment) what could possibly be going wrong…

I would very much appreciate assistance or guidance on this issue, if writing the post doesn’t cause me to end up rubber-duck debugging it first! Thank you very much for your time :smiley: