Dynamic Programming for Byte-level BPE

kevin998x · May 1, 2022, 2:58am

Could anyone explain the rationale behind equation (1) in Neural Machine Translation with Byte-Level Subwords ?
Besides, what does it exactly mean by The design of UTF-8 encoding ensures the uniqueness of this recovery process: for a character UTF-8 encoded with multiple bytes, its trailing bytes will not make a valid UTF-8 encoded character ?
How exactly are the hexadecimal digits being derived in Figure 1 ?

Topic		Replies	Views
ByteLevelBPETokenizer inconsistent behavior 🤗Tokenizers	0	416	July 23, 2020
Is it okay to split ids sequence when it is encoded using Byte-level BPE 🤗Tokenizers	0	346	July 7, 2021
MarianMt translation issue Intermediate	1	422	January 2, 2021
TokenizerFast with various units (e.g., BPE, wordpiece, word, character, unigram) Intermediate	1	430	November 12, 2020
MarianTokenizer sentencepiece model Beginners	0	272	November 4, 2021