Adding special tokens to LEDTokenizer

Context

  • I need to add 3 special section tokens to LED-base-16384 model vocabulary. (RoBERTa/LED uses ByteLevelBPE (Byte-Pair Encoding) and then finetune on a custom dataset.
  • Current vocab size: 0 to 50264 i.e. 50265.

Current Understanding

  1. Placeholder tokens exist:

  2. Two Possible Approaches:
    A. Replace placeholder tokens (50261-50263)

    • Maintains multiple of 8 (as mentioned in the above links)
    • Uncertain about impact on model

    B. Add after (50265-50267)

    • Safer but loses multiple of 8
    • New size would be 50268

Questions

  1. Which approach is recommended?
  2. What are the performance implications of:
    • Replacing placeholder tokens
    • Losing multiple-of-8 alignment
  3. Are there any documented cases of successfully replacing placeholder tokens?
  4. Will either approach affect the model’s learned patterns?
1 Like