Gemma-2 & Phi-3 SFT nuances

There are a few minor nuances that are important during fine-tuing/inference of LLMs. I am experimenting with Gemma-2 and Phi-3 and here are the little things that I don’t quiet get:

  1. Some LLM do not have pad token. Does it matter if i change set it as unknown or eos?

    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.add_eos_token = True
    tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
    

    or just

    tokenizer.pad_token = tokenizer.eos_token
    
  2. Does the padding size matter? Here it states that " Since LLMs are not trained to continue from pad tokens, your input needs to be left-padded", which I do not get, because if we pad from the left we will have pad tokens then the text, so the model have to continue from pad. Also during fine-tuning with flash_attention_2 it throws a warning to put padding_side=‘right’.

  3. Does it makes sense to use in context learning after SFT?
    I will update the list.