For example if i have the sentence " Kareem love Rawda but she can’t fly how to koberskobl"
and pass this sentence to tokenizer it will give me the following tokens:
[‘ka’, ‘##ree’, ‘##m’, ‘love’, ‘raw’, ‘##da’, ‘but’, ‘she’, ‘cannot’, ‘fly’, ‘how’, ‘to’, ‘do’, ‘kobe’, ‘##rsk’, ‘##ob’, ‘##l’]
and do the following code
for word in input_text.split():
if tokenizer.convert_tokens_to_ids([word]) == [tokenizer.unk_token_id]:
oov_words.append(word)
# Print the OOV words
print("Out-of-vocabulary words:", oov_words)
It will give me Out-of-vocabulary words: [‘Kareem’, ‘Rawda’, ‘koberskobl’]
Is this the best way to do this ! i think i have some misunderstanding of the Tokenization process !
what i am trying to do is how to know if the word have a high probability to be OOV !