What is a Text Pair in a tokenizer?

Hey all,

I am relatively new to HuggingFace and deep NLP in general. I have noticed in the documentation and in some example notebooks I have seen that tokenizers are used as follows:

tokenizer = SomeTokenizerClass()
encoding = tokenizer(text_to_tokenize, context_of_text)

Where text_to_tokenize and context_of_text are both str objects. In the documentation, this type of call is shown here

What does this type of call to a tokenizer do and why would it be different than encoding = tokenizer(text_to_tokenize + ' ' + context_of_text)
Thank you so much for your help!

5 Likes

Have you figured this out? I am also wondering the same. It seems there isn’t enough explanation on the Internet.

tl;dr It adds a [SEP] (idx 102) between the two sentences.

Here is your answer (see the long answer) BERT for multiple sentences - nlp - PyTorch Forums.

1 Like