Missing vocab in gpt2 model?

Hi there!

I’m new to this forum so I hope I’m posting this in the right place…

I am new to using gpt2/HuggingFace library but am trying to figure out how to use it for my purposes. I am currently trying to compare the probability of prediction tokens from GPT2 to actual tokens in an excerpt (Using a random book for now). My problem is, sometimes this token doesn’t exist in the vocab list, so a probability is not generated. What could I do to overcome this? An example would be ‘clocks’ - which I’m thinking maybe I’ll just have to go with the lemmatized word, but also ‘striking’ which cannot be further lemmatized, but it’s not in the vocab?

Many thanks!

Rain

I’m a beginner as well but from what I have seen is that you would actually encode this word and then, as your tokenizer doesn’t have it, it would return you a list of tokens that corresponds to that word. then you would just do the average of the word tokens probabilities