I’m trying to pretokenize input sentences using GPT2TokenizerFast, then feed the pretokenized input through the sub-word tokenizer.
I found out that the Fast Tokenizers in the huggingface library allow access to pretokenization through:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(string)
This works to get the pretokenized input, and I thought that I could just feed this into the tokenizer with ‘is split into words’ set to ‘True’ as follows:
tokenizer(pretokenized_input, return_tensors='pt', is_split_into_words = True).tokens()
but I get an unexpected result. After pretokenization of the sentence ‘The person went to the store’ I get:
[â€˜Ä The’, â€˜Ä person’, â€˜Ä went’, â€˜Ä to’, â€˜Ä the’, â€˜Ä store’]
But when I feed this through the tokenizer I get:
[â€˜Ä Ã„â€™, ‘ł’, ‘The’, â€˜Ä Ã„â€™, ‘ł’, ‘person’, â€˜Ä Ã„â€™, ‘ł’, ‘went’, â€˜Ä Ã„â€™, ‘ł’, ‘to’, â€˜Ä Ã„â€™, ‘ł’, ‘the’, â€˜Ä Ã„â€™, ‘ł’, ‘store’]
Why does it add the tokens: â€˜Ä Ã„â€™ and ‘ł’?