Hello, everyone.
I’m working on an NLI task (such as MNLI, RTE, etc.) where two sentences are given to predict if the first sentence entails the second one or not. I’d like to know how the huggingface tokenizer behaves when the length of the first sentence exceeds the maximum sequence length of the model.
I’m using encode_plus()
to tokenize my sentences as follows:
inputs = tokenizer.encode_plus(example.text_a, example.text_b, add_special_tokens=True, max_length=max_length,)
I’d like to avoid the case of the second sentence not being encoded since the first sentence itself already exceeds the maximum input sequence length of the model. Is there an option for the encode_plus()
function to truncate the first sentence to make sure I always have the second one in the processed data?