Importance of ignoring special tokens in loss function

I’m trying to train a video captioning model where I train only my image encoder and keep the text generator frozen. I am using the pre-trained GPT2LMHeadModel as the text generator where I feed it directly the video embeddings as input (via inputs_embeds).

For the tokenizer, I see that GPT2Tokenizer uses the same token id for BOS, EOS, and PAD tokens.

  1. If I want my loss function to ignore the pad token, then it will also ignore the BOS and EOS tokens. Is this fine? Or do I need to assign the pad token a different id which would then require me to finetune the text model as well since the embedding size would change?
  2. If I want to use a separate tokenizer (which has a smaller vocab size), then will training just the LM head work (nn.Linear with vocab size as output dim)? Or in such scenarios do I need to finetune GPT2 as well?