How to truncate from the head in AutoTokenizer?

acarjasdfy · August 11, 2020, 1:59pm

When we are tokenizing the input like this. If the text token number exceeds set max_lenth, the tokenizer will truncate from the tail end to limit the number of tokens to the max_length.

tokenizer = AutoTokenizer.from_pretrained('MODEL_PATH')
inputs = tokenizer(text, max_length=max_length, truncation=True, 
                               padding=True, return_tensors='pt')

Is there a way to change the behavior and truncate from the head end?

For example if text = ['cat', 'dog', 'human'] and max_length=2, currently the last word 'human' will be dropped, is it possible to add a truncation_end parameter to AutoTokenizer, and when truncation_end='head' drops 'cat' from the head of the sentence?

rbint · August 11, 2020, 5:49pm

I’m not sure if there is a built in way to do this, but are you able to reverse the list before calling the tokenizer? I.e., change text to [‘human’, ‘dog’, ‘cat’] then call the tokenizer so cat is dropped from the tail.

Karthik12 · September 26, 2020, 11:44am

I was wondering if you would want to manipulate the result after you tokenize the text, as below:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(‘bert-base-uncased’)
encoded_sent = tokenizer.encode_plus (text=‘cat dog human’, truncation=False, padding=False)

print(‘Before Truncation: \n’)
print(encoded_sent)
print(tokenizer.convert_ids_to_tokens(encoded_sent[‘input_ids’]))

print(’\nAfter truncation: \n’)

#If maxlen greater than text size, use ‘PAD’ token
#If maxlen lesser than text size, truncate the first token
#Remember, we reserve two tokens for [CLS] and [SEP] as we use BERT here
maxlen =10

ids = encoded_sent[‘input_ids’]
if len(ids)>=maxlen:
ids = [ids[0]] + ids[2:maxlen] + [102]
else:
ids = ids + ([0] * (maxlen-len(ids)))
encoded_sent[‘input_ids’]=ids

ids = encoded_sent[‘token_type_ids’]
if len(ids)>=maxlen:
ids = [ids[0]] + ids[2:maxlen] + [0]
else:
ids = ids + ([0] * (maxlen-len(ids)))

encoded_sent[‘token_type_ids’]=ids

ids = encoded_sent[‘attention_mask’]
if len(ids)>=maxlen:
ids = [ids[0]] + ids[2:maxlen] + [1]
else:
ids = ids + ([0] * (maxlen-len(ids)))
encoded_sent[‘attention_mask’]=ids

print(encoded_sent)
print(tokenizer.convert_ids_to_tokens(encoded_sent[‘input_ids’]))

Topic		Replies	Views
How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace? 🤗Tokenizers	0	936	May 15, 2022
How to ensure that tokenizers never truncate partial words? 🤗Tokenizers	2	1786	January 24, 2022
No maximum length is provided with camembert-large 🤗Transformers	0	816	February 3, 2022
Truncate the seq. not working 🤗Transformers	0	833	August 17, 2022
How padding in huggingface tokenizer works? 🤗Tokenizers	4	6730	November 22, 2021

How to truncate from the head in AutoTokenizer?

Related topics