When we are tokenizing the input like this. If the text token number exceeds set max_lenth, the tokenizer will truncate from the tail end to limit the number of tokens to the max_length.
Is there a way to change the behavior and truncate from the head end?
For example if text = ['cat', 'dog', 'human'] and max_length=2, currently the last word 'human' will be dropped, is it possible to add a truncation_end parameter to AutoTokenizer, and when truncation_end='head' drops 'cat' from the head of the sentence?
I’m not sure if there is a built in way to do this, but are you able to reverse the list before calling the tokenizer? I.e., change text to [‘human’, ‘dog’, ‘cat’] then call the tokenizer so cat is dropped from the tail.
#If maxlen greater than text size, use ‘PAD’ token #If maxlen lesser than text size, truncate the first token #Remember, we reserve two tokens for [CLS] and [SEP] as we use BERT here
maxlen =10