Tokenizing two sentences with the tokenizer

Seohyeong · October 18, 2021, 1:49am

Hello, everyone.

I’m working on an NLI task (such as MNLI, RTE, etc.) where two sentences are given to predict if the first sentence entails the second one or not. I’d like to know how the huggingface tokenizer behaves when the length of the first sentence exceeds the maximum sequence length of the model.

I’m using encode_plus() to tokenize my sentences as follows:
inputs = tokenizer.encode_plus(example.text_a, example.text_b, add_special_tokens=True, max_length=max_length,)

I’d like to avoid the case of the second sentence not being encoded since the first sentence itself already exceeds the maximum input sequence length of the model. Is there an option for the encode_plus() function to truncate the first sentence to make sure I always have the second one in the processed data?

nielsr · October 18, 2021, 7:33am

Hi,

As explained in the docs, you can specify several possible strategies for the truncation parameter, including 'only_first'. Also, the encode_plus method is outdated actually. It is recommended to just call the tokenizer, both on single sentence or pair of sentences. TLDR:

inputs = tokenizer(text_a, text_b, truncation='only_first', max_length=max_length)

Topic		Replies	Views
How does transformers.pipeline works for NLI? Beginners	4	1591	May 5, 2021
How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace? 🤗Tokenizers	0	937	May 15, 2022
How padding in huggingface tokenizer works? 🤗Tokenizers	4	6750	November 22, 2021
When using the API, how can I limit the lenght of the answer and still get complete sentences? Beginners	1	691	December 23, 2023
Changing Tokenizer's max_length gets weird result Beginners	2	429	May 17, 2022

Tokenizing two sentences with the tokenizer

Related topics