Why multilingual BERT tokenizer doesn't remove accent markers?

xuxuxu · July 18, 2021, 7:25am

Hi everyone
I’m wondering why the pretrained ‘bert-base-multilingual-cased’ tokenizer cannot remove accent markers?

To reproduce:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
tokenizer.convert_ids_to_tokens(tokenizer.encode('résumé'))

and the result gives

['[CLS]', 'r', '##és', '##um', '##é', '[SEP]’]

However, the document says “When calling encode() or encode_batch(), the input text(s) go through the following pipiline: Normalization, Pre-Tokenization, The Model, Post-Processing” in this page: The tokenization pipeline — tokenizers documentation. The Normalization includes stripping whitespace, removing accented characters or lowercasing all text, which is also mentioned in the BERT GitHub README: bert/multilingual.md at master · google-research/bert · GitHub.
According to the huggingface tokenizer introduction, we can evaluate as follows

from tokenizers import normalizers 
from tokenizers.normalizers import NFD, StripAccents 
normalizer = normalizers.Sequence([NFD(), StripAccents()])
normalizer.normalize_str('résumé')

and the result gives:

'resume’

Thus, it seems to me that tokenizer does not perform accent removal.
What’s happening here? I’m quite confused.

Please help clarify this problem.

Topic		Replies	Views
Tokenizer post_processor help 🤗Tokenizers	1	1352	October 27, 2022
How to use tokenizer.tokenize in Chinese data properly? 🤗Tokenizers	0	907	November 9, 2021
All my sequences get tokenized the same 🤗Tokenizers	2	609	February 12, 2022
How to make tokenizer convert subword token to an independent token? 🤗Tokenizers	4	622	September 9, 2020
Issue with post-processing 🤗Tokenizers	1	1102	June 15, 2022

Why multilingual BERT tokenizer doesn't remove accent markers?

Related topics