Fine-tuning queries

I am new to fine-tuning, but i have read many post so i have many queries please help me.

When fine-tuning LLM what is mostly done->
text= input + output

now input and labels are given to llm as input_ids of text.
Now my queries are:

  1. Am i correct that in labels the ids corresponding to pad_token_id are set -100 to prevent loss calculation and there are two ways to fine-tune either to calculate loss on both prompt + completion or on just completion.
  2. SFTTrainer class by default compute loss on both but using datacollators for completetion we can compute loss on only completion right ?
  3. What about Trainer Class what it does by default ? calculate loss on just completion or both.
  4. Another Question is that what if i don’t provide data collator and just myself format input with special tokens and tokenize these inputs without setting pad token ids to -100 in label what will happen in both cases of SFTTrainer class and Trainer class ?
  5. Do we have to mannualy shift label ? Also where i see how shifting of these labels is done internally cause i am trying to find in github but didn’t find.

What i am doing was:

  1. Created a prompt appended input and output in it.
  2. added special tokens and tokenzied it.
  3. Set labels as input ids like model_input[‘labels’]= model_inputs[‘input_ids’]
    without shifting it and without setting padding tokens to -100.
  4. passing these tokenized dataset directly to Trainer Class.
1 Like