How does `byte_fallback` work and affect vocab size in BPE?

BPE has byte fallback option to convert unk character to utf-8 bytes. That’s what I know for now. After reading wikipedia, stackoverflow, google repos, I still don’t understand how all things work together.

1 Like

It’s in Rust so it might be a not so convenient to read. tokenizers/tokenizers/src/decoders/byte_fallback.rs at main · huggingface/tokenizers · GitHub