How to choose dataset_text_field in SFTTrainer hugging face for my LLM model

Note: Newbie to LLM’s

Background

I am trying to train a LLM using LLama3 on stackoverflow c langauge dataset.

LLm - meta-llama/Meta-Llama-3-8B
Dataset - Mxode/StackOverflow-QA-C-Language-40k

My dataset structure looks like so

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 40649
    })
})

Why dataset_text_field is important?

This feild is crucial as LLM decides which column has to pick and train the model to answer the questions.

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    peft_config=peft_config,
    dataset_text_field="question",  # Specify the text field in the dataset <<<<<-----
    max_seq_length=4096,
    tokenizer=tokenizer,
    args=training_arguments,
)

Iam assuming if i keep the question then LLM gone pick questions column and train to answer the answers when user asked through prompt?

My assumption is right? I did trained my model for hours with questions but when i observed the response. looks like the responses are most of the questions context.

You need to provide a prompt for the LLM model and not the questions. For each row, you will need a prompt containing your question and answer. e.g. “”" Question: Which is the largest country.\n Answer: Russia"“”. This is just a demo. You can design your prompt and use the map function over your dataset to create a prompt column. You can refer to the notebook on Unsloth’s GitHub repo where they have trained llama 3.1 8b.