Find which tokens are unknown in new data

noncomp · September 2, 2022, 7:02pm

I am fine-tuning on new data and using the GPT subword tokenizer. Certain substrings of my input get converted to the UNK token during tokenization. How can I find these substrings so I can add them to my tokenizer’s vocabulary?

Topic		Replies	Views
How to efficiently tokenize unknown tokens in GPT2 Intermediate	0	1008	January 12, 2022
Word level tokenizer pulls special tokens out of pretokenized strings 🤗Tokenizers	3	18	July 4, 2025
Different tokenization for the same word fed alone vs in a sentence Beginners	0	279	July 6, 2021
Removing tokens from the GPT tokenizer 🤗Transformers	2	1971	August 20, 2024
`GPT2Tokenizer` Tokenizer handling `\n\n` differently in different settings 🤗Tokenizers	4	789	October 4, 2023

Find which tokens are unknown in new data

Related topics