Adding special tokens to LEDTokenizer

KeertiPrem · December 25, 2024, 9:10am

Context

I need to add 3 special section tokens to LED-base-16384 model vocabulary. (RoBERTa/LED uses ByteLevelBPE (Byte-Pair Encoding) and then finetune on a custom dataset.
Current vocab size: 0 to 50264 i.e. 50265.

Which approach is recommended?
What are the performance implications of:
- Replacing placeholder tokens
- Losing multiple-of-8 alignment
Are there any documented cases of successfully replacing placeholder tokens?
Will either approach affect the model’s learned patterns?

Topic		Replies	Views
Why are my special tokens not appearing as predictions? 🤗Transformers	0	413	July 29, 2021
RobertaTokenizer: How to enable masking of custom special tokens 🤗Transformers	1	987	April 24, 2021
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	400	March 19, 2021
Mask modelling on specific words Beginners	1	1059	March 25, 2021
Special tokens and inference Intermediate	0	338	November 16, 2020