Hi everyone
Iâm wondering why the pretrained âbert-base-multilingual-casedâ tokenizer cannot remove accent markers?
To reproduce:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
tokenizer.convert_ids_to_tokens(tokenizer.encode('résumé'))
and the result gives
['[CLS]', 'r', '##Ă©s', '##um', '##Ă©', '[SEP]â]
However, the document says âWhen calling encode() or encode_batch(), the input text(s) go through the following pipiline: Normalization, Pre-Tokenization, The Model, Post-Processingâ in this page: The tokenization pipeline â tokenizers documentation. The Normalization includes stripping whitespace, removing accented characters or lowercasing all text, which is also mentioned in the BERT GitHub README: bert/multilingual.md at master · google-research/bert · GitHub.
According to the huggingface tokenizer introduction, we can evaluate as follows
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
normalizer = normalizers.Sequence([NFD(), StripAccents()])
normalizer.normalize_str('résumé')
and the result gives:
'resumeâ
Thus, it seems to me that tokenizer
does not perform accent removal.
Whatâs happening here? Iâm quite confused.
Please help clarify this problem.