WordLevel Tokenization with GPT2?

Athena · March 27, 2021, 6:30pm

I am thinking of pretraining a GPT2 model from scratch where the tokens must be of whole words (e.g. using the WordLevel tokenizer) instead of subwords (e.g. using the ByteLevelBPETokenizer tokenizer). This is because subwords don’t make sense for my application.

Will the GPT2LMHeadModel and GPT2Tokenizer be able to accept word level tokenization?

Because a Tokenizer(WordLevel()) tokenizer saves to a single json file

tokenizer = Tokenizer(WordLevel())
tokenizer.train(files=["words.txt"])
tokenizer.save('wordlevel.json')

but the GPT2Tokenizer reads from 2 files and is based on a byte-level Byte-Pair-Encoding.

tokenizer = GPT2Tokenizer(
    os.path.join("vocab.json"),
    os.path.join("merges.txt"),
)

If byte-level byte-pair encoding must be used, can ByteLevelBPETokenizer be configured to do word level tokenization?

Thanks

boranapak · March 26, 2023, 1:01pm

Hey Athena,

I also want to generate tokens that must be whole words and I have the same problem you had 2 years ago! Did you end up figuring out a way to solve this problem? I know it has been a long time but it would really help me out a lot!

Thanks in advance!

Topic		Replies	Views
GPT2 Training from scratch in German 🤗Transformers	3	2311	October 3, 2020
Does the ByteLevelBPETokenizer need to be wrapped in a normal Tokenizer? 🤗Tokenizers	0	1827	March 18, 2023
BartTokenizer with vocab.json and merge.txt which were created by ByteLevelBPETokenizer encode <s> into 3 tokens Beginners	1	5628	January 27, 2021
Tokenizer Saving Issues, Wrapper Issues and Push to Hub issues Beginners	3	1491	May 12, 2024
Disable regex use when training a new GPT2 Tokenizer Intermediate	0	150	June 21, 2024

WordLevel Tokenization with GPT2?

Related topics