How to train new token embedding to add to a pretrain model?


I would like to take a pretrained model and only train new embeddings on a corpus, leaving the rest of the transformer untouched. Then, fine tuning on a task without changing the original embedding. Finally, swapping the embedding. All in all, how can I have control over only training the embeddings, leaving the embeddings untouched in training and swapping the embeddings of a model with the Hugging Face Transformer library ?
This is to follow the following approach taken in this article:

  1. Pre-train a monolingual BERT (i.e. a transformer) in L1 with masked language modeling
    (MLM) and next sentence prediction (NSP)
    objectives on an unlabeled L1 corpus.
  2. Transfer the model to a new language by learning new token embeddings while freezing the
    transformer body with the same training objectives (MLM and NSP) on an unlabeled L2
  3. Fine-tune the transformer for a downstream
    task using labeled data in L1, while keeping
    the L1 token embeddings frozen.
  4. Zero-shot transfer the resulting model to L2
    by swapping the L1 token embeddings with
    the L2 embeddings learned in Step 2.
    Thank you !

Well, you answered your own question. You can freeze layers in PyTorch by setting requires_grad=False to a layer’s parameters. They will not be updated during training. You can then load the model, swap out the weights of the embedding layer with other learnt weights and save the model again (In transformers you can use model.save_pretrained()).

I am not sure how much help you need. If you need a step-by-step guide, I fear I do not have the time to help with that. The above should you help you a bit.

1 Like