Hi,
Background
I want to tokenize several consecutive Chinese sentences using BertTokenizer
.
Currently I just concatenate all sentences together and do the tokenization, since the tokenizer can utilize the context information to get a better performance than tokenizing them separately.
Problem
But after tokenzation I realized that I cannot get the position and length of each sentence, actually I do need them since I want to get the embedding of each sentence.
What I tried
I tried many methods for example adding special tokens between sentences, or doing some matchings after tokenization, but all of them failed since either the special tokens influence the results or [UNK]
tokens make it difficult to do matchings.
I wonder if there exists any method to solve this problem?
Thank you for your help!