Find which tokens are unknown in new data

I am fine-tuning on new data and using the GPT subword tokenizer. Certain substrings of my input get converted to the UNK token during tokenization. How can I find these substrings so I can add them to my tokenizer’s vocabulary?

1 Like