How to pad tokens to a fixed length on a single sentence?

>>> from transformers import BartTokenizerFast
>>> tokenizer = BartTokenizerFast.from_pretrained("facebook/bart-large")
>>> str = "How are you?"
>>> tokenizer(str, return_tensors="pt")
{'input_ids': tensor([[   0, 6179,   32,   47,  116,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
>>> tokenizer(str, padding=True, max_length=10, return_tensors="pt")
{'input_ids': tensor([[   0, 6179,   32,   47,  116,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

Why didn’t it show something like

{'input_ids': tensor([[   0, 6179,   32,   47,  116,    2, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

and how can I do it?

hey @zuujhyt, you can activate the desired padding by specifying padding="max_length" in your tokenizer as follows:

tokenizer(str, return_tensors="pt", padding="max_length", max_length=10)

when padding=True the tokenizer will pad to the longest sequence in the batch (or no padding for the single sentence case)

1 Like