Hi,
The Tokenizer from the pretrained model tokenizes natural words (delimited by whitespace) into word pieces automatically. For example,
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("Laced with dreams-dripping in reality, the American Dream reignites after 9.11 with a true story about the Devil Ray's mid-life rookie , Jimmy Morris. ")
['[CLS]', 'laced', 'with', 'dreams', '-', 'dripping', 'in', 'reality', ',', 'the', 'american', 'dream', 'reign', '##ites', 'after', '9', '.', '11', 'with', 'a', 'true', 'story', 'about', 'the', 'devil', 'ray', "'", 's', 'mid', '-', 'life', 'rookie', ',', 'jimmy', 'morris', '.', '[SEP]']
# gpt2
['<|endoftext|>', 'L', 'aced', 'Ä with', 'Ä dreams', 'Ä -', 'Ä dripping', 'Ä in', 'Ä reality', ',', 'Ä the', 'Ä American', 'Ä Dream', 'Ä reign', 'ites', 'Ä after', 'Ä 9', '.', '11', 'Ä with', 'Ä a', 'Ä true', 'Ä story', 'Ä about', 'Ä the', 'Ä Devil', 'Ä Ray', "'s", 'Ä mid', '-', 'life', 'Ä rookie', ',', 'Ä Jimmy', 'Ä Morris', '.', '<|endoftext|>']
# xlnet-base-cased
['<cls>', 'âLac', 'ed', 'âwith', 'âdreams', 'â', '-', 'âdripping', 'âin', 'âreality', ',', 'âthe', 'âAmerican', 'âDream', 'âreign', 'ites', 'âafter', 'â9', '.', '11', 'âwith', 'âa', 'âtrue', 'âstory', 'âabout', 'âthe', 'âDevil', 'âRay', "'", 's', 'âmid', '-', 'life', 'ârookie', ',', 'âJimmy', 'âMorris', '.', '</s>']
# xlm-mlm-enfr-1024
['<s>', 'laced</w>', 'with</w>', 'dreams</w>', '-</w>', 'dri', 'pping</w>', 'in</w>', 'reality</w>', ',</w>', 'the</w>', 'americ', 'an</w>', 'dream</w>', 're', 'ign', 'ites</w>', 'after</w>', '9.', '11</w>', 'with</w>', 'a</w>', 'true</w>', 'story</w>', 'about</w>', 'the</w>', 'devil</w>', 'ray</w>', "'s</w>", 'mid</w>', '-</w>', 'life</w>', 'rookie</w>', ',</w>', 'j', 'im', 'my</w>', 'mor', 'ris</w>', '.</w>', '</s>']
However, I want to tokenize the sentence into linguistic words rather than word pieces when the Transformer pretrained model is introduced and its Tokenizer is employed. I want to use natural words to enter transformer.
The result I want to get and the natural words enter the Transforer model to do some calculations.
['Laced', 'with', 'dreams-dripping', 'in', 'reality', ',', 'the', 'American', 'Dream', 'reignites', 'after', '9.11', 'with', 'a', 'true', 'story', 'about', 'the', 'Devil', 'Ray', "'s", 'mid-life', 'rookie', ',', 'Jimmy', 'Morris', '.']
How to make some setup in the Tokenizer to realize this?
Many thanks!
Best, Kevin