How tokenize natural words by using Tokenizer from transformer pretrained models

Hi,

The Tokenizer from the pretrained model tokenizes natural words (delimited by whitespace) into word pieces automatically. For example,

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("Laced with dreams-dripping in reality,  the American Dream reignites after 9.11 with a true story about the Devil Ray's mid-life rookie , Jimmy Morris. ")
['[CLS]', 'laced', 'with', 'dreams', '-', 'dripping', 'in', 'reality', ',', 'the', 'american', 'dream', 'reign', '##ites', 'after', '9', '.', '11', 'with', 'a', 'true', 'story', 'about', 'the', 'devil', 'ray', "'", 's', 'mid', '-', 'life', 'rookie', ',', 'jimmy', 'morris', '.', '[SEP]']
# gpt2
['<|endoftext|>', 'L', 'aced', 'Ġwith', 'Ġdreams', 'Ġ-', 'Ġdripping', 'Ġin', 'Ġreality', ',', 'Ġthe', 'ĠAmerican', 'ĠDream', 'Ġreign', 'ites', 'Ġafter', 'Ġ9', '.', '11', 'Ġwith', 'Ġa', 'Ġtrue', 'Ġstory', 'Ġabout', 'Ġthe', 'ĠDevil', 'ĠRay', "'s", 'Ġmid', '-', 'life', 'Ġrookie', ',', 'ĠJimmy', 'ĠMorris', '.', '<|endoftext|>']
# xlnet-base-cased
['<cls>', '▁Lac', 'ed', '▁with', '▁dreams', '▁', '-', '▁dripping', '▁in', '▁reality', ',', '▁the', '▁American', '▁Dream', '▁reign', 'ites', '▁after', '▁9', '.', '11', '▁with', '▁a', '▁true', '▁story', '▁about', '▁the', '▁Devil', '▁Ray', "'", 's', '▁mid', '-', 'life', '▁rookie', ',', '▁Jimmy', '▁Morris', '.', '</s>']
# xlm-mlm-enfr-1024
['<s>', 'laced</w>', 'with</w>', 'dreams</w>', '-</w>', 'dri', 'pping</w>', 'in</w>', 'reality</w>', ',</w>', 'the</w>', 'americ', 'an</w>', 'dream</w>', 're', 'ign', 'ites</w>', 'after</w>', '9.', '11</w>', 'with</w>', 'a</w>', 'true</w>', 'story</w>', 'about</w>', 'the</w>', 'devil</w>', 'ray</w>', "'s</w>", 'mid</w>', '-</w>', 'life</w>', 'rookie</w>', ',</w>', 'j', 'im', 'my</w>', 'mor', 'ris</w>', '.</w>', '</s>']



However, I want to tokenize the sentence into linguistic words rather than word pieces when the Transformer pretrained model is introduced and its Tokenizer is employed. I want to use natural words to enter transformer.

The result I want to get and the natural words enter the Transforer model to do some calculations.

['Laced', 'with', 'dreams-dripping', 'in', 'reality', ',', 'the', 'American', 'Dream', 'reignites', 'after', '9.11', 'with', 'a', 'true', 'story', 'about', 'the', 'Devil', 'Ray', "'s", 'mid-life', 'rookie', ',', 'Jimmy', 'Morris', '.']

How to make some setup in the Tokenizer to realize this?

Many thanks!

Best, Kevin