I am a bit confused with the way SFFTrainer is used for finetuning a LLM. Take for example Llama-2.
Approach-1 Link
model_id= "NousResearch/Llama-2-7b-hf"
dataset = load_dataset("mlabonne/mini-platypus", split="train")
The dataset has two fields [‘instruction’, ‘output’]. The instruction is formatted text as …
Instruction:
Text
Response:
And the output is just the required textual response. When the SFTTrainer is instantiated, there is only reference to the instruction field
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
eval_dataset=dataset,
peft_config=peft_config,
dataset_text_field="instruction",
max_seq_length=512,
tokenizer=tokenizer,
args=training_arguments,
)
there is no train eval split in the dataset , still the author feeds the same dataset in train_dataset and eval_dataset attributes.
Approach-2 Link
model_id = "NousResearch/llama-2-7b-chat-hf"
dataset_name = "mlabonne/guanaco-llama2-1k"
In this case the dataset has only one filed text and it is formatted as
This time when the SFTTrainer is instantiated , there’s reference only to the text field.
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)
So the question is , how does the trainer know to look for the “output” field in the first case and how does the trainer know , that the entire data is in “text” field.