I noticed that by default, GPT2LMHeadModel
returns prediction scores of shape (batch_size, sequence_length, config.vocab_size)
(docs link). Is there any way for me to limit the output vocabulary to only a subset of words?
I want to take the existing weights from GPT-2, but re-train a new top linear layer with a smaller vocabulary. I suppose I could mask the logits at the end, but then it feels like a waste of computational power to even predict them.