I’m trying to train a video captioning model where I train only my image encoder and keep the text generator frozen. I am using the pre-trained GPT2LMHeadModel
as the text generator where I feed it directly the video embeddings as input (via inputs_embeds
).
For the tokenizer, I see that GPT2Tokenizer
uses the same token id for BOS, EOS, and PAD tokens.
- If I want my loss function to ignore the pad token, then it will also ignore the BOS and EOS tokens. Is this fine? Or do I need to assign the pad token a different id which would then require me to finetune the text model as well since the embedding size would change?
- If I want to use a separate tokenizer (which has a smaller vocab size), then will training just the LM head work (
nn.Linear
with vocab size as output dim)? Or in such scenarios do I need to finetune GPT2 as well?