How does `byte_fallback` work and affect vocab size in BPE?

dinhanhx · December 5, 2023, 9:39am

BPE has byte fallback option to convert unk character to utf-8 bytes. That’s what I know for now. After reading wikipedia, stackoverflow, google repos, I still don’t understand how all things work together.

alvations · March 19, 2024, 9:06am

It’s in Rust so it might be a not so convenient to read. tokenizers/tokenizers/src/decoders/byte_fallback.rs at main · huggingface/tokenizers · GitHub

Topic		Replies	Views
Dynamic Programming for Byte-level BPE Research	0	874	May 1, 2022
Byte Level Tokenizer While Training 🤗Tokenizers	0	52	December 14, 2024
Change bpe-dropout value on the fly? 🤗Tokenizers	0	433	October 24, 2020
How to properly clean vocabulary from BBPE tokenizer 🤗Tokenizers	3	1044	October 1, 2022
Discussing the Pros and Cons of Using add_tokens vs. Byte Pair Encoding (BPE) for Adding New Tokens to an Existing RoBERTa Model 🤗Tokenizers	0	768	July 14, 2023

How does `byte_fallback` work and affect vocab size in BPE?

Related topics