Combine multiple sentences together during tokenization

prajjwal1 · January 29, 2021, 2:52am

I want my output to be like [CLS] [SEP] text1 [SEP] text2 [SEP] text3 [SEP] eos token. As per the default behaviour, tokenizer expects either a string or a pair of string.
tokenizer(sentence1, sentence2) # returns a single vector value for input_ids. I want this but for three sentences
I want the pair of string behavior for three sentences. I can pass a list of sentences, but that creates 3 lists of input_ids.
tokenizer([sentence1, sentence2, sentence3]) # returns three tensors for input_ids

I want a single tensor representing the output I wrote above.
Is there any good way of doing it ?

valhalla · January 29, 2021, 9:12am

I don’t think tokenizer handles this case directly.

You could directly join the sentences using [SEP] and then encode it as one single text.

tok = BertTokenizer.from_pretrained("bert-base-cased")
text = "sent1 [SEP] sent2 [SEP] sent3"
ids = tok(text, add_special_tokens=True).input_ids
tok.decode(ids)
=> '[CLS] sent1 [SEP] sent2 [SEP] sent3 [SEP]'

prajjwal1 · January 29, 2021, 2:54pm

Okay. Thanks. May have to manually add then.

bilalghanem · February 4, 2022, 11:32pm

@prajjwal1 … I want to highlight a point here: using more than 2 [SEP] tokens with bert is not a g scientific solution. Bert wasn’t trained for that.

Topic		Replies	Views
Multiple Mask Tokens 🤗Transformers	4	7484	February 12, 2022
Having Multiple [MASK] tokens in a sentence Beginners	2	2489	April 8, 2021
Use two sentences as inputs for sentence classification 🤗Transformers	7	20213	April 21, 2022
Sentence splitting 🤗Tokenizers	7	31787	September 15, 2022
Fine-tuning a masked language model Beginners	0	355	February 2, 2022

Combine multiple sentences together during tokenization

Related topics