I have a large corpus from a specific domain.
We decided that it was better to create a tokenizer from scratch, there was too many important tokens that was’nt in the general language tokenizer vocabulary.
We use SentensePieceBPETokenizer.
However, in my first tests, odd things happened.
Lets consider the generic token “token”. It and its capitalized variations apear on the vocabulary.
However, also “(token”, ““token”, “\ntoken”, “\n\ntoken”, “\n\n\ntoken”, “\n(token”, “.\nToken” (parentesis, quotation mark, a sequence of 1 or more new line (\n), and their combination. Even period, new line and the token ) apear on the vocabulary.
Therefore, the tokenizer is not segmenting the \n, quotation marks, parenthesis, etc from the tokens. The tokenizer is adding more words with the variations of the tokens with these annoying characters. So, words that have the same meaning are considered other words in the vocabulary.
How should I handle this problem? And how to do it using Tokenizer class?
For now, my code is going like this:
tokenizer = SentencePieceBPETokenizer()
tokenizer.normalizer=normalizers.Sequence( [normalizers.NFD() ] )
I am aware that I could include the special tokens as the parameter special_tokens on the train_from_iterator function of the tokenizer, for instance:
special_tokens=[‘[PAD]’, ‘[UNK]’, ‘[CLS]’, ‘[SEP]’, ‘[MASK]’]
But how to include more special tokens, and attribute specific characters (like newline, quotation mark, parenthesis) to them?
Also, if doing so, these tokens are treated as segmented from other tokens when building a tokenizer? Will it solve my problem?