What is a Text Pair in a tokenizer?

evanv · April 24, 2022, 9:50pm

Hey all,

I am relatively new to HuggingFace and deep NLP in general. I have noticed in the documentation and in some example notebooks I have seen that tokenizers are used as follows:

tokenizer = SomeTokenizerClass()
encoding = tokenizer(text_to_tokenize, context_of_text)

Where text_to_tokenize and context_of_text are both str objects. In the documentation, this type of call is shown here

What does this type of call to a tokenizer do and why would it be different than encoding = tokenizer(text_to_tokenize + ' ' + context_of_text)
Thank you so much for your help!

drt · September 3, 2022, 3:58pm

Have you figured this out? I am also wondering the same. It seems there isn’t enough explanation on the Internet.

davebulaval · July 18, 2023, 11:11pm

tl;dr It adds a [SEP] (idx 102) between the two sentences.

Here is your answer (see the long answer) BERT for multiple sentences - nlp - PyTorch Forums.

Topic		Replies	Views
Combine multiple sentences together during tokenization 🤗Tokenizers	3	5634	February 4, 2022
Two approaches to training a tokenizer Beginners	0	976	March 6, 2023
Newbie: Main difference between tokenizers? 🤗Tokenizers	0	836	May 6, 2021
Writing custom tokenizer and wrapping it in tokenizer object 🤗Tokenizers	2	782	June 26, 2023
Sentence splitting 🤗Tokenizers	7	31744	September 15, 2022

What is a Text Pair in a tokenizer?

Related topics