Disable regex use when training a new GPT2 Tokenizer

jchwenger · June 21, 2024, 2:36pm

Hi there,

I have noticed that when training a GPT2 tokenizer in a simpler manner:

toke_base = AutoTokenizer.from_pretrained('gpt2',use_fast=True)

then by default it uses the regex used by the original GPT2 for pre-tokenizing (as expected). Unfortunately, for large datasets without spaces, that can create OOM issues.

I would like to create a GPT2-like tokenizer with everything the same except for this regex use (so to say something like use_regex=False). I could achieve that by doing:

from tokenizers.pre_tokenizers import ByteLevel
toke_base.backend_tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False, use_regex=False)

However I realise that I also need to adjust the post_processor and the decoder. That’s when I run into trouble: I try changing those attributes, but when saving the tokenizer the options remain there ("use_regex": true for "post_processor" and for "decoder") inside tokenizer.json.

Does anyone know if this operation (just switching off the use_regex for this tokenizer) exists? Thanks in advance!

Topic		Replies	Views
Fine tuning and retokenizing Beginners	0	589	May 29, 2022
GPT2 Training from scratch in German 🤗Transformers	3	2311	October 3, 2020
Training GPT-2 from scratch Beginners	2	1228	August 3, 2020
GPT-2 full python tokenizer example for Q/A finetuning Beginners	1	861	December 27, 2022
How to train GPT-2 for text summarization? Models	4	9555	November 24, 2024

Disable regex use when training a new GPT2 Tokenizer

Related topics