Note: Newbie to LLM’s
Background
I am trying to train a LLM using LLama3 on stackoverflow c
langauge dataset.
LLm - meta-llama/Meta-Llama-3-8B
Dataset - Mxode/StackOverflow-QA-C-Language-40k
My dataset structure looks like so
DatasetDict({
train: Dataset({
features: ['question', 'answer'],
num_rows: 40649
})
})
Why dataset_text_field is important?
This feild is crucial as LLM decides which column has to pick and train the model to answer the questions.
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
peft_config=peft_config,
dataset_text_field="question", # Specify the text field in the dataset <<<<<-----
max_seq_length=4096,
tokenizer=tokenizer,
args=training_arguments,
)
Iam assuming if i keep the question then LLM gone pick questions column and train to answer the answers when user asked through prompt?
My assumption is right? I did trained my model for hours with questions
but when i observed the response. looks like the responses are most of the questions context.