Pre_tokenization

jonatm4 · April 13, 2023, 12:02am

I’m trying to pretokenize input sentences using GPT2TokenizerFast, then feed the pretokenized input through the sub-word tokenizer.

I found out that the Fast Tokenizers in the huggingface library allow access to pretokenization through:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(string)

This works to get the pretokenized input, and I thought that I could just feed this into the tokenizer with ‘is split into words’ set to ‘True’ as follows:
tokenizer(pretokenized_input, return_tensors='pt', is_split_into_words = True).tokens()
but I get an unexpected result. After pretokenization of the sentence ‘The person went to the store’ I get:
[‘ĠThe’, ‘Ġperson’, ‘Ġwent’, ‘Ġto’, ‘Ġthe’, ‘Ġstore’]

But when I feed this through the tokenizer I get:
[‘ĠÄ’, ‘ł’, ‘The’, ‘ĠÄ’, ‘ł’, ‘person’, ‘ĠÄ’, ‘ł’, ‘went’, ‘ĠÄ’, ‘ł’, ‘to’, ‘ĠÄ’, ‘ł’, ‘the’, ‘ĠÄ’, ‘ł’, ‘store’]

Why does it add the tokens: ‘ĠÄ’ and ‘ł’?

Topic		Replies	Views
Training GPT-2 from scratch Beginners	2	1230	August 3, 2020
GPT2TokenizerFast tokenzied output Beginners	0	154	December 29, 2023
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6639	February 9, 2024
Is_pretokenized argument for tokenizer doesn't work? 🤗Transformers	1	1787	September 18, 2020
Save tokenizer with argument 🤗Tokenizers	2	1962	October 26, 2022

Pre_tokenization

Related topics