Tokenizing two sentences with the tokenizer

Hello, everyone.

I’m working on an NLI task (such as MNLI, RTE, etc.) where two sentences are given to predict if the first sentence entails the second one or not. I’d like to know how the huggingface tokenizer behaves when the length of the first sentence exceeds the maximum sequence length of the model.

I’m using encode_plus() to tokenize my sentences as follows:
inputs = tokenizer.encode_plus(example.text_a, example.text_b, add_special_tokens=True, max_length=max_length,)

I’d like to avoid the case of the second sentence not being encoded since the first sentence itself already exceeds the maximum input sequence length of the model. Is there an option for the encode_plus() function to truncate the first sentence to make sure I always have the second one in the processed data?


As explained in the docs, you can specify several possible strategies for the truncation parameter, including 'only_first'. Also, the encode_plus method is outdated actually. It is recommended to just call the tokenizer, both on single sentence or pair of sentences. TLDR:

inputs = tokenizer(text_a, text_b, truncation='only_first', max_length=max_length)

1 Like