Missing vocab in gpt2 model?

Rainiefantasy · July 19, 2021, 11:41am

Hi there!

I’m new to this forum so I hope I’m posting this in the right place…

I am new to using gpt2/HuggingFace library but am trying to figure out how to use it for my purposes. I am currently trying to compare the probability of prediction tokens from GPT2 to actual tokens in an excerpt (Using a random book for now). My problem is, sometimes this token doesn’t exist in the vocab list, so a probability is not generated. What could I do to overcome this? An example would be ‘clocks’ - which I’m thinking maybe I’ll just have to go with the lemmatized word, but also ‘striking’ which cannot be further lemmatized, but it’s not in the vocab?

Many thanks!

Rain

Felipehonorato · September 14, 2021, 5:17pm

I’m a beginner as well but from what I have seen is that you would actually encode this word and then, as your tokenizer doesn’t have it, it would return you a list of tokens that corresponds to that word. then you would just do the average of the word tokens probabilities

Topic		Replies	Views
Training GPT-2 from scratch Beginners	2	1228	August 3, 2020
Can't figure out how to implement gpt2 tokenizer in fine-tuning Beginners	0	329	July 22, 2022
Train gpt-2 from scratch in Italian Beginners	0	880	September 8, 2022
GPT-2 fine-tuning Beginners	0	1606	June 12, 2023
Tokenizer is not being loaded on Huggingface Inference 🤗Tokenizers	0	986	September 22, 2022

Missing vocab in gpt2 model?

Related topics