After tokenization, how to get sub-sentence length in a long sentence?


I want to tokenize several consecutive Chinese sentences using BertTokenizer.
Currently I just concatenate all sentences together and do the tokenization, since the tokenizer can utilize the context information to get a better performance than tokenizing them separately.

But after tokenzation I realized that I cannot get the position and length of each sentence, actually I do need them since I want to get the embedding of each sentence.

What I tried
I tried many methods for example adding special tokens between sentences, or doing some matchings after tokenization, but all of them failed since either the special tokens influence the results or [UNK] tokens make it difficult to do matchings.

I wonder if there exists any method to solve this problem?
Thank you for your help!