Disable regex use when training a new GPT2 Tokenizer

Hi there,

I have noticed that when training a GPT2 tokenizer in a simpler manner:

toke_base = AutoTokenizer.from_pretrained('gpt2',use_fast=True)

then by default it uses the regex used by the original GPT2 for pre-tokenizing (as expected). Unfortunately, for large datasets without spaces, that can create OOM issues.

I would like to create a GPT2-like tokenizer with everything the same except for this regex use (so to say something like use_regex=False). I could achieve that by doing:

from tokenizers.pre_tokenizers import ByteLevel
toke_base.backend_tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False, use_regex=False)

However I realise that I also need to adjust the post_processor and the decoder. That’s when I run into trouble: I try changing those attributes, but when saving the tokenizer the options remain there ("use_regex": true for "post_processor" and for "decoder") inside tokenizer.json.

Does anyone know if this operation (just switching off the use_regex for this tokenizer) exists? Thanks in advance!