Missing vocab in gpt2 model?

Hi there!

I’m new to this forum so I hope I’m posting this in the right place…

I am new to using gpt2/HuggingFace library but am trying to figure out how to use it for my purposes. I am currently trying to compare the probability of prediction tokens from GPT2 to actual tokens in an excerpt (Using a random book for now). My problem is, sometimes this token doesn’t exist in the vocab list, so a probability is not generated. What could I do to overcome this? An example would be ‘clocks’ - which I’m thinking maybe I’ll just have to go with the lemmatized word, but also ‘striking’ which cannot be further lemmatized, but it’s not in the vocab?

Many thanks!


I’m a beginner as well but from what I have seen is that you would actually encode this word and then, as your tokenizer doesn’t have it, it would return you a list of tokens that corresponds to that word. then you would just do the average of the word tokens probabilities