WordLevel Tokenization with GPT2?

I am thinking of pretraining a GPT2 model from scratch where the tokens must be of whole words (e.g. using the WordLevel tokenizer) instead of subwords (e.g. using the ByteLevelBPETokenizer tokenizer). This is because subwords don’t make sense for my application.

Will the GPT2LMHeadModel and GPT2Tokenizer be able to accept word level tokenization?

Because a Tokenizer(WordLevel()) tokenizer saves to a single json file

tokenizer = Tokenizer(WordLevel())
tokenizer.train(files=["words.txt"])
tokenizer.save('wordlevel.json')

but the GPT2Tokenizer reads from 2 files and is based on a byte-level Byte-Pair-Encoding.

tokenizer = GPT2Tokenizer(
    os.path.join("vocab.json"),
    os.path.join("merges.txt"),
)

If byte-level byte-pair encoding must be used, can ByteLevelBPETokenizer be configured to do word level tokenization?

Thanks :slight_smile:

Hey Athena,

I also want to generate tokens that must be whole words and I have the same problem you had 2 years ago! Did you end up figuring out a way to solve this problem? I know it has been a long time but it would really help me out a lot!

Thanks in advance!