Hi all,
I’m currently trying to fine-tune the openlm-research/open_llama_3b
model on my own custom dataset. My current code is:
model_name = "openlm-research/open_llama_3b_v2"
import os
import os.path
import random
from torch.utils.data import Dataset
import torch
from transformers import AutoTokenizer
# Set up tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
tokenizer.pad_token_id = tokenizer.eos_token_id
# Seed randomizer for reproducibility
random.seed(42)
class MyDataset(Dataset):
def __init__(self, data: list[str]):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx]
data_set_file_names = [fname for fname in os.listdir(".") if os.path.isfile(fname)]
# Tokenize each sample
all_content = []
for fname in data_set_file_names:
with open(fname) as f:
text = f.read()
t = tokenizer(text, return_tensors="pt", padding='max_length', truncation=True)
all_content.append(t)
# Shuffle the list (but seeded so it's replicatable)
random.shuffle(all_content)
# Split into training (60%), evaluation (20%), and test (20%)
training_end_len = int(len(all_content) * 0.6)
eval_test_end_len = int(len(all_content) * 0.2)
train_data = MyDataset(all_content[:training_end_len])
eval_data = MyDataset(all_content[training_end_len:training_end_len + eval_test_end_len])
test_data = MyDataset(all_content[training_end_len + eval_test_end_len:training_end_len + 2 * eval_test_end_len])
# Load the pre-trained model
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
# Set up training
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir="trained_model")
trainer = Trainer(
model=base_model,
args=training_args,
tokenizer=tokenizer,
train_dataset=train_data,
)
trainer.train()
Upon running this, I get the following error from inside the LLaMA model (added some print
statements to try to understand what was going on with the error):
hi there
hidden_states.size() = torch.Size([8, 1, 2048, 3200])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py in forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, cache_position)
634 print("hi there")
635 print(f"{hidden_states.size() = }")
--> 636 bsz, q_len, _ = hidden_states.size()
637
638 query_states = self.q_proj(hidden_states)
ValueError: too many values to unpack (expected 3)
So I can see that the shape of hidden_states
is not what it should be - it has a size of 4 (shape of 8, 1, 2048, 3200) when it should have a size of 3. Looks like…
- 8 samples per batch (perhaps should correspond to
bsz
, for “batch size”?) - 1… something? maybe this part needs to get squeezed out?
- 2048 position embeddings (perhaps corresponding to
q_len
, for “question/query length”?) - 3200 hidden size (presumably the part of the size getting ommitted via the
_
?)
My guess so far is that I’m making some mistake with how I’m loading and preprocessing my data set samples, but I don’t see (at the moment) what could possibly be going wrong…
I would very much appreciate assistance or guidance on this issue, if writing the post doesn’t cause me to end up rubber-duck debugging it first! Thank you very much for your time