Special tokens seem to be considered atomic. However, the implementation of special tokens is quite complex (it has been revised and changed over a long period of time), so it would be safer to search for information while working on it.
I’m building a BPE tokenizer from scratch, and I’d like to add some tokens to its vocabulary that are never broken apart and never included in merges.
The second part of that (no tokens in the merges) is easy – the tokens don’t appear in the training text I give the tokenizer. But for the first part (never breaking my special token, e.g. [Y], into its [,Y, and ] components), I’m less sure of the correct process.
I had though I could just add the tokens to the tokenizer using add_tokens() befo…
opened 02:45AM - 21 Apr 22 UTC
closed 02:03AM - 27 Apr 22 UTC
Hi,
I want to train a tokenizer with code like the following
```
# I am not… sure about the correct way, so I try to add '<sep>' in every possible way.
trainer = BpeTrainer(special_tokens=["<unk>", "<pad>", '<sep>'], vocab_size=vocab_size, )
tokenizer.add_special_tokens(['<sep>'])
tokenizer.add_tokens(['<sep>'])
tokenizer.train_from_iterator(trainer=trainer, iterator=iterator_over_seqs)
```
An example sequence is `ABCD<sep>EFGH`.
However the trained vocabulary contains token `'<'`, `'>'` and `'e'`, `'ep'`, `'p'`, `'s'`, `'sep'`, which are undesired.
So I'm wondering what should I do to let the tokenizer taking the `'<sep>'` as a single special token?
Thanks