Space token ' ' cannot be add when is_split_into_words = True

Boltzmachine · March 11, 2021, 9:27am

for example,

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
>>> tokenizer.add_tokens(' ')
1

>>> tokenizer.encode('你好 世界', add_special_tokens=False)
[872, 1962, 21128, 686, 4518]
>>> tokenizer.encode(['你','好',' ', '世', '界'], is_split_into_words=True, add_special_tokens=False)
 [872, 1962, 686, 4518]

Obviously, the blank token is ignored. But if you change it to another token like ‘[balabala]’, it works.
So what is the proper way to do this?

Boltzmachine · March 11, 2021, 9:35am

I found that one way is to use convert_tokens_to_ids, yet by which I cannot use the convenient features in encode and __call__ such as padding and automatically generating attention_mask

Topic		Replies	Views
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	379	March 19, 2021
Added Tokens Not Decoding with Spaces 🤗Tokenizers	3	2831	January 19, 2024
How to decode with spaces? 🤗Tokenizers	0	1859	April 28, 2022
How to make tokenizer add the spaces correctly when decoding a sequence when set add_prefix_space=False 🤗Tokenizers	0	565	October 9, 2023
Adding a new mask_token for BERT-like models/tokenizers Intermediate	0	543	May 26, 2023

Space token ' ' cannot be add when is_split_into_words = True

Related topics