Hi there,
I have noticed that when training a GPT2 tokenizer in a simpler manner:
toke_base = AutoTokenizer.from_pretrained('gpt2',use_fast=True)
then by default it uses the regex used by the original GPT2 for pre-tokenizing (as expected). Unfortunately, for large datasets without spaces, that can create OOM issues.
I would like to create a GPT2-like tokenizer with everything the same except for this regex use (so to say something like use_regex=False
). I could achieve that by doing:
from tokenizers.pre_tokenizers import ByteLevel
toke_base.backend_tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False, use_regex=False)
However I realise that I also need to adjust the post_processor
and the decoder
. That’s when I run into trouble: I try changing those attributes, but when saving the tokenizer the options remain there ("use_regex": true
for "post_processor"
and for "decoder"
) inside tokenizer.json
.
Does anyone know if this operation (just switching off the use_regex
for this tokenizer) exists? Thanks in advance!