How to truncate from the head in AutoTokenizer?

When we are tokenizing the input like this. If the text token number exceeds set max_lenth, the tokenizer will truncate from the tail end to limit the number of tokens to the max_length.

tokenizer = AutoTokenizer.from_pretrained('MODEL_PATH')
inputs = tokenizer(text, max_length=max_length, truncation=True, 
                               padding=True, return_tensors='pt')

Is there a way to change the behavior and truncate from the head end?

For example if text = ['cat', 'dog', 'human'] and max_length=2, currently the last word 'human' will be dropped, is it possible to add a truncation_end parameter to AutoTokenizer, and when truncation_end='head' drops 'cat' from the head of the sentence?

2 Likes

I’m not sure if there is a built in way to do this, but are you able to reverse the list before calling the tokenizer? I.e., change text to [‘human’, ‘dog’, ‘cat’] then call the tokenizer so cat is dropped from the tail.

1 Like

I was wondering if you would want to manipulate the result after you tokenize the text, as below:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(‘bert-base-uncased’)
encoded_sent = tokenizer.encode_plus (text=‘cat dog human’, truncation=False, padding=False)

print(‘Before Truncation: \n’)
print(encoded_sent)
print(tokenizer.convert_ids_to_tokens(encoded_sent[‘input_ids’]))

print(’\nAfter truncation: \n’)

#If maxlen greater than text size, use ‘PAD’ token
#If maxlen lesser than text size, truncate the first token
#Remember, we reserve two tokens for [CLS] and [SEP] as we use BERT here
maxlen =10

ids = encoded_sent[‘input_ids’]
if len(ids)>=maxlen:
ids = [ids[0]] + ids[2:maxlen] + [102]
else:
ids = ids + ([0] * (maxlen-len(ids)))
encoded_sent[‘input_ids’]=ids

ids = encoded_sent[‘token_type_ids’]
if len(ids)>=maxlen:
ids = [ids[0]] + ids[2:maxlen] + [0]
else:
ids = ids + ([0] * (maxlen-len(ids)))

encoded_sent[‘token_type_ids’]=ids

ids = encoded_sent[‘attention_mask’]
if len(ids)>=maxlen:
ids = [ids[0]] + ids[2:maxlen] + [1]
else:
ids = ids + ([0] * (maxlen-len(ids)))
encoded_sent[‘attention_mask’]=ids

print(encoded_sent)
print(tokenizer.convert_ids_to_tokens(encoded_sent[‘input_ids’]))