Resize_token_embeddings for performance

I read the blogpost Initializing New Word Embeddings for Pretrained Language Models · John Hewitt and understand the idea of resize with mean initialization for finetuning. However, if it’s just for performance purposes, and we just do inference with original checkpoint untouched, is resize with mean init enough ? How does transformers guarantee that padding positions are not sampled ?

1 Like