After tokenization, how to get sub-sentence length in a long sentence?

Alethia · October 2, 2022, 7:21pm

Hi,

Background
I want to tokenize several consecutive Chinese sentences using BertTokenizer.
Currently I just concatenate all sentences together and do the tokenization, since the tokenizer can utilize the context information to get a better performance than tokenizing them separately.

Problem
But after tokenzation I realized that I cannot get the position and length of each sentence, actually I do need them since I want to get the embedding of each sentence.

What I tried
I tried many methods for example adding special tokens between sentences, or doing some matchings after tokenization, but all of them failed since either the special tokens influence the results or [UNK] tokens make it difficult to do matchings.

I wonder if there exists any method to solve this problem?
Thank you for your help!

Topic		Replies	Views
Xlm-Roberta Tokenizing 🤗Transformers	3	470	January 19, 2021
Is 512 token in bert, token or character level? Beginners	3	9255	April 4, 2022
Token classification on long sentences 🤗Transformers	0	835	February 2, 2022
Newbie Seeking Guidance on Optimal Sentence Size for Embedding Encoding 🙏 Beginners	3	1956	April 13, 2023
Sentence splitting 🤗Tokenizers	7	31782	September 15, 2022

After tokenization, how to get sub-sentence length in a long sentence?

Related topics