Context
- I need to add 3 special section tokens to LED-base-16384 model vocabulary. (RoBERTa/LED uses ByteLevelBPE (Byte-Pair Encoding) and then finetune on a custom dataset.
- Current vocab size: 0 to 50264 i.e. 50265.
Current Understanding
-
Placeholder tokens exist:
- madeupword0000 (50261)
- madeupword0001 (50262)
- madeupword0002 (50263)
Problem with mask token id in RoBERTa vocab · Issue #1091 · huggingface/transformers · GitHub
Temporary fix for RoBERTa's mismatch of vocab size and embedding size - issue #1091 by amirsaffari · Pull Request #1096 · huggingface/transformers · GitHub
-
Two Possible Approaches:
A. Replace placeholder tokens (50261-50263)- Maintains multiple of 8 (as mentioned in the above links)
- Uncertain about impact on model
B. Add after (50265-50267)
- Safer but loses multiple of 8
- New size would be 50268
Questions
- Which approach is recommended?
- What are the performance implications of:
- Replacing placeholder tokens
- Losing multiple-of-8 alignment
- Are there any documented cases of successfully replacing placeholder tokens?
- Will either approach affect the model’s learned patterns?