Is there a way to collate the OOV tokens for future finetuning?
I wanted to leave unknown words untokenized instead of being replaced by
<unk> but couldn’t figure out how to.
For example the text
"Hi there hello word" is sent to the tokenizer and outputs
[Hi, <unk>, hello, word]
But I want the tokenizer to output
[Hi, there, hello, word] even if the word “there” is OOV.
Seems like OpenNMT (https://forum.opennmt.net/t/leave-unknown-words-untranslated/2790) has it implemented and I was wondering if HF has it as I would want to stick to the HF framework.