How to know if a word is OOV or not with my model

For example if i have the sentence " Kareem love Rawda but she can’t fly how to koberskobl"

and pass this sentence to tokenizer it will give me the following tokens:
[‘ka’, ‘##ree’, ‘##m’, ‘love’, ‘raw’, ‘##da’, ‘but’, ‘she’, ‘cannot’, ‘fly’, ‘how’, ‘to’, ‘do’, ‘kobe’, ‘##rsk’, ‘##ob’, ‘##l’]
and do the following code

for word in input_text.split():
    if tokenizer.convert_tokens_to_ids([word]) == [tokenizer.unk_token_id]:
        oov_words.append(word)

# Print the OOV words
print("Out-of-vocabulary words:", oov_words)

It will give me Out-of-vocabulary words: [‘Kareem’, ‘Rawda’, ‘koberskobl’]
Is this the best way to do this ! i think i have some misunderstanding of the Tokenization process !
what i am trying to do is how to know if the word have a high probability to be OOV !

1 Like

hi . did you get any response on this question?

1 Like