Leaving unknown words untokenized like in OpenMNT


Is there a way to collate the OOV tokens for future finetuning?
I wanted to leave unknown words untokenized instead of being replaced by <unk> but couldn’t figure out how to.

For example the text "Hi there hello word" is sent to the tokenizer and outputs [Hi, <unk>, hello, word]
But I want the tokenizer to output [Hi, there, hello, word] even if the word “there” is OOV.

Seems like OpenNMT (https://forum.opennmt.net/t/leave-unknown-words-untranslated/2790) has it implemented and I was wondering if HF has it as I would want to stick to the HF framework.