How to make tokenizer convert subword token to an independent token?

Recently, I have been using bert-base-multilingual-cased for my work in bengali. When I feed in a sentence like “āĻ†āĻœāĻ•ā§‡ āĻšāĻŦā§‡ āĻ¨āĻžâ€ to BertTokenizer, I get the following output.

Sentence:  āĻ†āĻœāĻ•ā§‡ āĻšāĻŦā§‡ āĻ¨āĻž
Tokens:  ['āĻ†', '##āĻœ', '##āĻ•ā§‡', 'āĻšāĻŦā§‡', 'āĻ¨āĻž']
To Int:  [938, 24383, 18243, 53761, 26109]

But when I fed in a sentence like “āĻ†āĻœāĻ•ā§‡ āĻšāĻŦā§‡āĻ¨āĻžâ€ with “āĻ¨āĻžâ€ not spaced, I see tokens been “##āĻ¨āĻžâ€ with corresponding index also been changed

Sentence:  āĻ†āĻœāĻ•ā§‡ āĻšāĻŦā§‡āĻ¨āĻž
Tokens:  ['āĻ†', '##āĻœ', '##āĻ•ā§‡', 'āĻšāĻŦā§‡', '##āĻ¨āĻž']
To Int:  [938, 24383, 18243, 53761, 20979]

Now, I was hoping is there anyway to let tokenizer to know that if they find anything like ‘##āĻ¨āĻžâ€™, convert them to ‘āĻ¨āĻžâ€™ for all such cases.

It might be easier to replace the āĻ¨āĻž in the sentence with “space” āĻ¨āĻž before you tokenize.

Is it just the ##āĻ¨āĻž that is a problem, or do you want to get rid of all the ## continuation tokens?

Recently I realized, it is not for all āĻ¨āĻž.

And no, not for all ## continuation tokens, for few only.

Can’t you just replace the tokens before converting them to IDs?

# set of all tokens that should you be replaced
NEED_REPL = {"##āĻ¨āĻž"}

def retokenize(tokens):
    return [t.replace("##", "", 1) if t in NEED_REPL else t for t in tokens]


tokens = ['āĻ†', '##āĻœ', '##āĻ•ā§‡', 'āĻšāĻŦā§‡', '##āĻ¨āĻž']
replaced = retokenize(tokens)
print(replaced)
# ['āĻ†', '##āĻœ', '##āĻ•ā§‡', 'āĻšāĻŦā§‡', 'āĻ¨āĻž']

Yes but when I pass in the whole dataset for the tokenizer to handle, I had to do something like

    encoding = self.tokenizer.encode_plus(
    reviews,
    add_special_tokens = True,
    max_length = self.max_len,
    return_token_type_ids=True,
    pad_to_max_length=True,
    return_attention_mask=True,
    return_tensors='pt',
)

where the encoding variable consists the input_ids and attention_masks for each sentence respectively.

How could I overwrite or overcome the tokenizer function?