How tokenize natural words by using Tokenizer from transformer pretrained models

Kevintu · November 23, 2022, 6:42pm

Hi,

The Tokenizer from the pretrained model tokenizes natural words (delimited by whitespace) into word pieces automatically. For example,

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("Laced with dreams-dripping in reality,  the American Dream reignites after 9.11 with a true story about the Devil Ray's mid-life rookie , Jimmy Morris. ")
['[CLS]', 'laced', 'with', 'dreams', '-', 'dripping', 'in', 'reality', ',', 'the', 'american', 'dream', 'reign', '##ites', 'after', '9', '.', '11', 'with', 'a', 'true', 'story', 'about', 'the', 'devil', 'ray', "'", 's', 'mid', '-', 'life', 'rookie', ',', 'jimmy', 'morris', '.', '[SEP]']
# gpt2
['<|endoftext|>', 'L', 'aced', 'Ġwith', 'Ġdreams', 'Ġ-', 'Ġdripping', 'Ġin', 'Ġreality', ',', 'Ġthe', 'ĠAmerican', 'ĠDream', 'Ġreign', 'ites', 'Ġafter', 'Ġ9', '.', '11', 'Ġwith', 'Ġa', 'Ġtrue', 'Ġstory', 'Ġabout', 'Ġthe', 'ĠDevil', 'ĠRay', "'s", 'Ġmid', '-', 'life', 'Ġrookie', ',', 'ĠJimmy', 'ĠMorris', '.', '<|endoftext|>']
# xlnet-base-cased
['<cls>', '▁Lac', 'ed', '▁with', '▁dreams', '▁', '-', '▁dripping', '▁in', '▁reality', ',', '▁the', '▁American', '▁Dream', '▁reign', 'ites', '▁after', '▁9', '.', '11', '▁with', '▁a', '▁true', '▁story', '▁about', '▁the', '▁Devil', '▁Ray', "'", 's', '▁mid', '-', 'life', '▁rookie', ',', '▁Jimmy', '▁Morris', '.', '</s>']
# xlm-mlm-enfr-1024
['<s>', 'laced</w>', 'with</w>', 'dreams</w>', '-</w>', 'dri', 'pping</w>', 'in</w>', 'reality</w>', ',</w>', 'the</w>', 'americ', 'an</w>', 'dream</w>', 're', 'ign', 'ites</w>', 'after</w>', '9.', '11</w>', 'with</w>', 'a</w>', 'true</w>', 'story</w>', 'about</w>', 'the</w>', 'devil</w>', 'ray</w>', "'s</w>", 'mid</w>', '-</w>', 'life</w>', 'rookie</w>', ',</w>', 'j', 'im', 'my</w>', 'mor', 'ris</w>', '.</w>', '</s>']

However, I want to tokenize the sentence into linguistic words rather than word pieces when the Transformer pretrained model is introduced and its Tokenizer is employed. I want to use natural words to enter transformer.

The result I want to get and the natural words enter the Transforer model to do some calculations.

['Laced', 'with', 'dreams-dripping', 'in', 'reality', ',', 'the', 'American', 'Dream', 'reignites', 'after', '9.11', 'with', 'a', 'true', 'story', 'about', 'the', 'Devil', 'Ray', "'s", 'mid-life', 'rookie', ',', 'Jimmy', 'Morris', '.']

How to make some setup in the Tokenizer to realize this?

Many thanks!

Best, Kevin

Topic		Replies	Views
Tokenizer from own vocab 🤗Tokenizers	0	457	July 11, 2022
Use a pretrained ByteLevelBPETokenizer on text 🤗Tokenizers	1	3764	July 17, 2020
Tokenization compared to sentencepiece 🤗Tokenizers	0	91	September 11, 2024
Show Submodels of PegasusTokenizer 🤗Tokenizers	1	631	April 28, 2022
Index of wordpieces (subwords) after tokenization by transformers 🤗Tokenizers	0	699	August 28, 2021

How tokenize natural words by using Tokenizer from transformer pretrained models

Related topics