Iām not sure if itās a bug/feature but sometimes modifying the normalizer of a pretrained tokenizer works but sometimes it doesnāt.
For example, it works for "mistralai/Mistral-7B-v0.1"
but not "mistralai/Mistral-7B-v0.3"
:
from transformers import AutoTokenizer
from tokenizers.normalizers import Sequence, Replace, Prepend
tokenizer_name = "mistralai/Mistral-7B-v0.1"
old_tok = AutoTokenizer.from_pretrained(tokenizer_name)
assert old_tok.backend_tokenizer.normalizer != None
new_normalizer = Sequence(
[Prepend('ā'), Replace('ā', ' '), Replace("foo", "bar"), Replace('<br>', '\n')]
)
old_tok.backend_tokenizer.normalizer = new_normalizer
new_tokenizdr_name = f"new_tokenizer-{tokenizer_name}"
old_tok.save_pretrained(new_tokenizdr_name)
old_tok = AutoTokenizer.from_pretrained(tokenizer_name)
new_tok = AutoTokenizer.from_pretrained(new_tokenizdr_name)
[out]:
>>> print(' '.join(old_tok.batch_decode(old_tok("I foo you<br>hello world")['input_ids'])))
<s> I foo you < br > hello world
>>> print(' '.join(new_tok.batch_decode(new_tok("I foo you<br>hello world")['input_ids'])))
<s> I bar you
hello world
The same process above wonāt work for "mistralai/Mistral-7B-v0.3"
.