How to make tokenizer convert subword token to an independent token?

Recently, I have been using bert-base-multilingual-cased for my work in bengali. When I feed in a sentence like “āφāϜāϕ⧇ āĻšāĻŦ⧇ āύāĻžâ€ to BertTokenizer, I get the following output.

Sentence:  āφāϜāϕ⧇ āĻšāĻŦ⧇ āύāĻž
Tokens:  ['āφ', '##āϜ', '##āϕ⧇', 'āĻšāĻŦ⧇', 'āύāĻž']
To Int:  [938, 24383, 18243, 53761, 26109]

But when I fed in a sentence like “āφāϜāϕ⧇ āĻšāĻŦ⧇āύāĻžâ€ with “āύāĻžâ€ not spaced, I see tokens been “##āύāĻžâ€ with corresponding index also been changed

Sentence:  āφāϜāϕ⧇ āĻšāĻŦ⧇āύāĻž
Tokens:  ['āφ', '##āϜ', '##āϕ⧇', 'āĻšāĻŦ⧇', '##āύāĻž']
To Int:  [938, 24383, 18243, 53761, 20979]

Now, I was hoping is there anyway to let tokenizer to know that if they find anything like ‘##āύāĻžâ€™, convert them to ‘āύāĻžâ€™ for all such cases.

It might be easier to replace the āύāĻž in the sentence with “space” āύāĻž before you tokenize.

Is it just the ##āύāĻž that is a problem, or do you want to get rid of all the ## continuation tokens?

Recently I realized, it is not for all āύāĻž.

And no, not for all ## continuation tokens, for few only.

Can’t you just replace the tokens before converting them to IDs?

# set of all tokens that should you be replaced
NEED_REPL = {"##āύāĻž"}

def retokenize(tokens):
    return [t.replace("##", "", 1) if t in NEED_REPL else t for t in tokens]


tokens = ['āφ', '##āϜ', '##āϕ⧇', 'āĻšāĻŦ⧇', '##āύāĻž']
replaced = retokenize(tokens)
print(replaced)
# ['āφ', '##āϜ', '##āϕ⧇', 'āĻšāĻŦ⧇', 'āύāĻž']

Yes but when I pass in the whole dataset for the tokenizer to handle, I had to do something like

    encoding = self.tokenizer.encode_plus(
    reviews,
    add_special_tokens = True,
    max_length = self.max_len,
    return_token_type_ids=True,
    pad_to_max_length=True,
    return_attention_mask=True,
    return_tensors='pt',
)

where the encoding variable consists the input_ids and attention_masks for each sentence respectively.

How could I overwrite or overcome the tokenizer function?