How to make tokenizer convert subword token to an independent token?

Recently, I have been using bert-base-multilingual-cased for my work in bengali. When I feed in a sentence like “আজকে হবে না” to BertTokenizer, I get the following output.

Sentence:  আজকে হবে না
Tokens:  ['আ', '##জ', '##কে', 'হবে', 'না']
To Int:  [938, 24383, 18243, 53761, 26109]

But when I fed in a sentence like “আজকে হবেনা” with “না” not spaced, I see tokens been “##না” with corresponding index also been changed

Sentence:  আজকে হবেনা
Tokens:  ['আ', '##জ', '##কে', 'হবে', '##না']
To Int:  [938, 24383, 18243, 53761, 20979]

Now, I was hoping is there anyway to let tokenizer to know that if they find anything like ‘##না’, convert them to ‘না’ for all such cases.

It might be easier to replace the না in the sentence with “space” না before you tokenize.

Is it just the ##না that is a problem, or do you want to get rid of all the ## continuation tokens?

Recently I realized, it is not for all না.

And no, not for all ## continuation tokens, for few only.

Can’t you just replace the tokens before converting them to IDs?

# set of all tokens that should you be replaced
NEED_REPL = {"##না"}

def retokenize(tokens):
    return [t.replace("##", "", 1) if t in NEED_REPL else t for t in tokens]


tokens = ['আ', '##জ', '##কে', 'হবে', '##না']
replaced = retokenize(tokens)
print(replaced)
# ['আ', '##জ', '##কে', 'হবে', 'না']

Yes but when I pass in the whole dataset for the tokenizer to handle, I had to do something like

    encoding = self.tokenizer.encode_plus(
    reviews,
    add_special_tokens = True,
    max_length = self.max_len,
    return_token_type_ids=True,
    pad_to_max_length=True,
    return_attention_mask=True,
    return_tensors='pt',
)

where the encoding variable consists the input_ids and attention_masks for each sentence respectively.

How could I overwrite or overcome the tokenizer function?