How to Tokenize Accents (German Umlaut öäü)

dikster99 · March 17, 2021, 10:13am

I have just noticed that German tokenized text looks rather strange in comparison to using the same settings for English (as shown below). For English I get one token per word but for German there are more tokens than that?

I also noticed the strip_accents setting here - is this available for the encode_plus method as well or do I have to use the direct tokenizer API for that? If the later, what would be an equivalent way for implementing the settings below with strip_accents toggled such that they are not removed?

Thanx Dirk

from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased') 

tokens = tokenizer.encode_plus('schön gut'
                               ,max_length = 50
                               ,truncation=True
                               ,padding='max_length'
                               ,add_special_tokens=True
                               ,return_token_type_ids = False
                               ,return_attention_mask = True
                               ,return_tensors='tf')

print (tokens)

{‘input_ids’: <tf.Tensor: shape=(1, 10), dtype=int32, numpy=
array([[ 101, 187, 10269, 15248, 23639, 102, 0, 0, 0,
0]], dtype=int32)>, ‘attention_mask’: <tf.Tensor: shape=(1, 10), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]], dtype=int32)>}

tokens = tokenizer.encode_plus('good day'
                               ,max_length = 10
                               ,truncation=True
                               ,padding='max_length'
                               ,add_special_tokens=True
                               ,return_token_type_ids = False
                               ,return_attention_mask = True
                               ,return_tensors='tf')
print (tokens)

{‘input_ids’: <tf.Tensor: shape=(1, 10), dtype=int32, numpy=
array([[ 101, 15198, 11940, 102, 0, 0, 0, 0, 0,
0]], dtype=int32)>, ‘attention_mask’: <tf.Tensor: shape=(1, 10), dtype=int32, numpy=array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0]], dtype=int32)>}

Topic	Replies	Views
Why multilingual BERT tokenizer doesn't remove accent markers? 🤗Tokenizers	915	July 18, 2021
Tokenizer for German lang 🤗Tokenizers	592	June 22, 2023
Issue with german umlauts python in deepseek-ai/deepseek-coder-1.3b-instruct 🤗Tokenizers	224	February 16, 2024
Transformers v3.0.0 is out! 🤗Transformers	1936	July 7, 2020
How to add special tokens to a pretrained model? Beginners	386	June 18, 2021

How to Tokenize Accents (German Umlaut öäü)

Related topics