I have just noticed that German tokenized text looks rather strange in comparison to using the same settings for English (as shown below). For English I get one token per word but for German there are more tokens than that?
I also noticed the strip_accents setting here - is this available for the encode_plus method as well or do I have to use the direct tokenizer API for that? If the later, what would be an equivalent way for implementing the settings below with strip_accents toggled such that they are not removed?
Thanx Dirk
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
tokens = tokenizer.encode_plus('schön gut'
,max_length = 50
,truncation=True
,padding='max_length'
,add_special_tokens=True
,return_token_type_ids = False
,return_attention_mask = True
,return_tensors='tf')
print (tokens)
{‘input_ids’: <tf.Tensor: shape=(1, 10), dtype=int32, numpy=
array([[ 101, 187, 10269, 15248, 23639, 102, 0, 0, 0,
0]], dtype=int32)>, ‘attention_mask’: <tf.Tensor: shape=(1, 10), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]], dtype=int32)>}
tokens = tokenizer.encode_plus('good day'
,max_length = 10
,truncation=True
,padding='max_length'
,add_special_tokens=True
,return_token_type_ids = False
,return_attention_mask = True
,return_tensors='tf')
print (tokens)
{‘input_ids’: <tf.Tensor: shape=(1, 10), dtype=int32, numpy=
array([[ 101, 15198, 11940, 102, 0, 0, 0, 0, 0,
0]], dtype=int32)>, ‘attention_mask’: <tf.Tensor: shape=(1, 10), dtype=int32, numpy=array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0]], dtype=int32)>}