BPE has byte fallback option to convert unk character to utf-8 bytes. That’s what I know for now. After reading wikipedia, stackoverflow, google repos, I still don’t understand how all things work together.
1 Like
It’s in Rust so it might be a not so convenient to read. tokenizers/tokenizers/src/decoders/byte_fallback.rs at main · huggingface/tokenizers · GitHub