Unable to train model (Loss is 0.000000)

I am trying to fine tune the LLM(OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5) with my data.

import torch
from transformers import LineByLineTextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5", 
                                             load_in_8bit=True,
                                             device_map="auto")

from datasets import load_dataset

# Load the dataset
dataset = load_dataset('parquet', data_files='data/dataset.parquet')

# Tokenize and format the dataset
def tokenize_function(examples):
    return tokenizer(examples['TEXT'], truncation=True, max_length=128, padding='max_length')


tokenized_dataset = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=100,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=4
)



data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

# Create the Trainer and train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    data_collator=data_collator,
)

trainer.train()

# Save the trained model
trainer.save_model("model")  # replace with the path where you want to save the model
tokenizer.save_pretrained("model")

Now the issue is while training, loss is 0.000000 meaning there is something wrong with my training, Also when I am loading the trainied model, answers are not coming at all(Which should not be the case). Also the downloaded actual model disk size is 23GB but mine model size is 9.6GB

My raw data is in csv which I have then converted to parquet. My dataset has 3 columns(TEXT, source, metadata). Also my dataset only contains 12 rows

This is how I have generated parquet file

df = pd.read_csv('data/data.csv')

df.to_parquet("data/dataset.parquet", row_group_size=100, engine="pyarrow", index=False)
1 Like

Hey Banak. I am also getting the same issue when I am trying to finetune a model with QLoRa. After 200 steps my loss is 0.000000. Any luck you had resolving this issue.?

I got same problem with mistral