Is it okay to split ids sequence when it is encoded using Byte-level BPE

jisng-prk · July 7, 2021, 8:59am

TLDR;

So, When I use byte level BPE tokenizer,
Is it possible to split the encoded sequences into sub list without loss of information.
I think when the sublist is decoded into string, It should be generate error at the tail of the sequence because the last ids has only part of the bytes of a string token

I’m using BART BPE , which is same with that of RoBERTa and thus, it is byte-level BPE

When using the tokenizer, it split the sequence based on the byte sequence.

But, when I should split the sequence into limited sequence length such as 128,
maybe, some byte sequence could be split into different documents, thus it should makes error when decode the byte into string because the part of the byte is not provided in the tail of ids sequence

For example, when I encode the following sequence

“This string will be encoded as byte level” => [0, … , 2]

And I split the ids into two lists,

[0, … , 2] => [0, …], […, 2]

the boundary ids between two lists can miss some information about the rest byte of the token string

So, When I use byte level BPE,
Is it possible to split the encoded sequences into sub list without loss of information?

Topic		Replies	Views
Byte Level Tokenizer While Training 🤗Tokenizers	0	52	December 14, 2024
Batch tokenize (split into tokens, without processing) 🤗Tokenizers	4	741	October 28, 2023
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4142	March 11, 2025
Rs-bpe tokenizer [PyPI \| Python] - Outperforms tiktoken & tokenizers 🤗Tokenizers	2	47	March 19, 2025
Rs-bpe [PyPI \| Python] - Outperforms tiktoken & tokenizers Show and Tell	1	25	March 20, 2025

Is it okay to split ids sequence when it is encoded using Byte-level BPE

Related topics