Recently, I have been using bert-base-multilingual-cased for my work in bengali. When I feed in a sentence like âāĻāĻāĻā§ āĻšāĻŦā§ āύāĻžâ to BertTokenizer, I get the following output.
Sentence: āĻāĻāĻā§ āĻšāĻŦā§ āύāĻž
Tokens: ['āĻ', '##āĻ', '##āĻā§', 'āĻšāĻŦā§', 'āύāĻž']
To Int: [938, 24383, 18243, 53761, 26109]
But when I fed in a sentence like âāĻāĻāĻā§ āĻšāĻŦā§āύāĻžâ with âāύāĻžâ not spaced, I see tokens been â##āύāĻžâ with corresponding index also been changed
Sentence: āĻāĻāĻā§ āĻšāĻŦā§āύāĻž
Tokens: ['āĻ', '##āĻ', '##āĻā§', 'āĻšāĻŦā§', '##āύāĻž']
To Int: [938, 24383, 18243, 53761, 20979]
Now, I was hoping is there anyway to let tokenizer to know that if they find anything like â##āύāĻžâ, convert them to âāύāĻžâ for all such cases.
It might be easier to replace the āύāĻž in the sentence with âspaceâ āύāĻž before you tokenize.
Is it just the ##āύāĻž that is a problem, or do you want to get rid of all the ## continuation tokens?
Recently I realized, it is not for all āύāĻž.
And no, not for all ## continuation tokens, for few only.
Canât you just replace the tokens before converting them to IDs?
# set of all tokens that should you be replaced
NEED_REPL = {"##āύāĻž"}
def retokenize(tokens):
return [t.replace("##", "", 1) if t in NEED_REPL else t for t in tokens]
tokens = ['āĻ', '##āĻ', '##āĻā§', 'āĻšāĻŦā§', '##āύāĻž']
replaced = retokenize(tokens)
print(replaced)
# ['āĻ', '##āĻ', '##āĻā§', 'āĻšāĻŦā§', 'āύāĻž']
Yes but when I pass in the whole dataset for the tokenizer to handle, I had to do something like
encoding = self.tokenizer.encode_plus(
reviews,
add_special_tokens = True,
max_length = self.max_len,
return_token_type_ids=True,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)
where the encoding variable consists the input_ids and attention_masks for each sentence respectively.
How could I overwrite or overcome the tokenizer function?