Hi
Is there a way to collate the OOV tokens for future finetuning?
I wanted to leave unknown words untokenized instead of being replaced by <unk>
but couldn’t figure out how to.
For example the text "Hi there hello word"
is sent to the tokenizer and outputs [Hi, <unk>, hello, word]
But I want the tokenizer to output [Hi, there, hello, word]
even if the word “there” is OOV.
Seems like OpenNMT (https://forum.opennmt.net/t/leave-unknown-words-untranslated/2790) has it implemented and I was wondering if HF has it as I would want to stick to the HF framework.