I am new to fine-tuning, but i have read many post so i have many queries please help me.
When fine-tuning LLM what is mostly done->
text= input + output
now input and labels are given to llm as input_ids of text.
Now my queries are:
- Am i correct that in labels the ids corresponding to pad_token_id are set -100 to prevent loss calculation and there are two ways to fine-tune either to calculate loss on both prompt + completion or on just completion.
- SFTTrainer class by default compute loss on both but using datacollators for completetion we can compute loss on only completion right ?
- What about Trainer Class what it does by default ? calculate loss on just completion or both.
- Another Question is that what if i don’t provide data collator and just myself format input with special tokens and tokenize these inputs without setting pad token ids to -100 in label what will happen in both cases of SFTTrainer class and Trainer class ?
- Do we have to mannualy shift label ? Also where i see how shifting of these labels is done internally cause i am trying to find in github but didn’t find.
What i am doing was:
- Created a prompt appended input and output in it.
- added special tokens and tokenzied it.
- Set labels as input ids like model_input[‘labels’]= model_inputs[‘input_ids’]
without shifting it and without setting padding tokens to -100. - passing these tokenized dataset directly to Trainer Class.