Discussing the Pros and Cons of Using add_tokens vs. Byte Pair Encoding (BPE) for Adding New Tokens to an Existing RoBERTa Model

I’m currently working on a project that involves adding new tokens (specifically, domain-specific terms) to a pre-trained biomed_RoBERTa model. I have come across two main strategies for this:

  1. Using the add_tokens method provided by the Transformers library to directly add new tokens to the tokenizer’s vocabulary.
  2. Training a new Byte Pair Encoding (BPE) tokenizer on the new data and merging it with the existing vocabulary.

I’m interested in hearing your thoughts on the advantages and disadvantages of each method, particularly in the context of adding domain-specific vocabulary to a pre-existing model.

Add_tokens is a more straightforward approach and allows the model to be fine-tuned on the new data, rather than being trained from scratch. However, BPE has advantages, such as handling out-of-vocabulary words and producing more efficient tokenization.

I’d appreciate your insights on these points:

  • How does each method impact the model’s generalization of new, unseen data?
  • How do these methods influence computational efficiency during training and inference?
  • How does each strategy handle out-of-vocabulary words?
  • Are there any potential drawbacks or challenges in implementing these methods?

Thank you in advance for your insights!