Leaving unknown words untokenized like in OpenMNT

KhaiKit · October 18, 2023, 9:32am

Hi

Is there a way to collate the OOV tokens for future finetuning?
I wanted to leave unknown words untokenized instead of being replaced by <unk> but couldn’t figure out how to.

For example the text "Hi there hello word" is sent to the tokenizer and outputs [Hi, <unk>, hello, word]
But I want the tokenizer to output [Hi, there, hello, word] even if the word “there” is OOV.

Seems like OpenNMT (https://forum.opennmt.net/t/leave-unknown-words-untranslated/2790) has it implemented and I was wondering if HF has it as I would want to stick to the HF framework.

Topic		Replies	Views
How to know if a word is OOV or not with my model 🤗Transformers	1	334	February 4, 2025
OPT special tokens 🤗Tokenizers	0	157	March 25, 2024
How to use unk_token (unknown token) during wav2vec model finetuning Models	2	3777	May 19, 2022
Find which tokens are unknown in new data 🤗Tokenizers	0	535	September 2, 2022
SentencePiece tokenizer encodes to unknown token 🤗Tokenizers	0	883	August 2, 2023

Leaving unknown words untokenized like in OpenMNT

Related topics