How to choose dataset_text_field in SFTTrainer hugging face for my LLM model

Bhargavssss · June 30, 2024, 10:03am

Note: Newbie to LLM’s

Background

I am trying to train a LLM using LLama3 on stackoverflow c langauge dataset.

LLm - meta-llama/Meta-Llama-3-8B
Dataset - Mxode/StackOverflow-QA-C-Language-40k

My dataset structure looks like so

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 40649
    })
})

Why dataset_text_field is important?

This feild is crucial as LLM decides which column has to pick and train the model to answer the questions.

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    peft_config=peft_config,
    dataset_text_field="question",  # Specify the text field in the dataset <<<<<-----
    max_seq_length=4096,
    tokenizer=tokenizer,
    args=training_arguments,
)

Iam assuming if i keep the question then LLM gone pick questions column and train to answer the answers when user asked through prompt?

My assumption is right? I did trained my model for hours with questions but when i observed the response. looks like the responses are most of the questions context.

pranil51 · August 27, 2024, 6:46am

You need to provide a prompt for the LLM model and not the questions. For each row, you will need a prompt containing your question and answer. e.g. “”" Question: Which is the largest country.\n Answer: Russia"“”. This is just a demo. You can design your prompt and use the map function over your dataset to create a prompt column. You can refer to the notebook on Unsloth’s GitHub repo where they have trained llama 3.1 8b.

Topic		Replies	Views
SFTTrainer for Llama-2 Intermediate	0	91	August 3, 2024
Dataset format which will be given SFTTrainer 🤗Transformers	0	158	June 16, 2024
Whats happening in the SFT trainer? Beginners	13	2520	January 20, 2025
Question about llama fine tuning dataset token string Beginners	1	13	May 17, 2025
Best practice for usage of Data Collator For CompletionOnlyLM in multi-turn chat 🤗Transformers	2	683	May 25, 2025

How to choose dataset_text_field in SFTTrainer hugging face for my LLM model

Background

Why dataset_text_field is important?

Related topics