So, When I use byte level BPE tokenizer,
Is it possible to split the encoded sequences into sub list without loss of information.
I think when the sublist is decoded into string, It should be generate error at the tail of the sequence because the last ids has only part of the bytes of a string token
I’m using BART BPE , which is same with that of RoBERTa and thus, it is byte-level BPE
When using the tokenizer, it split the sequence based on the byte sequence.
But, when I should split the sequence into limited sequence length such as 128,
maybe, some byte sequence could be split into different documents, thus it should makes error when decode the byte into string because the part of the byte is not provided in the tail of ids sequence
For example, when I encode the following sequence
“This string will be encoded as byte level” => [0, … , 2]
And I split the ids into two lists,
[0, … , 2] => [0, …], […, 2]
the boundary ids between two lists can miss some information about the rest byte of the token string
So, When I use byte level BPE,
Is it possible to split the encoded sequences into sub list without loss of information?