Most effiecient way to move padding tokens to the right side of a tensor?

Shawn-Guo · March 26, 2024, 9:45pm

Hi, I was wondering the most efficient way to move the padding tokens in a given tensor to the rightmost? A more concrete example is illustrated below.

Suppose the padded tensor is the following one.

x = [
    [PAD, X, X, X, PAD],
    [PAD, PAD, X, X, X],
    [X, X, X, X, X]
]

I’d like the output to be:

x = [
    [X, X, X, PAD, PAD],
    [X, X, X, PAD, PAD],
    [X, X, X, X, X]
]

Thanks in advance for your help!

wrmthorne · December 12, 2024, 8:12am

A pretty old thread but I could have used the answer for the last hour or so. I can’t say its the most efficient but at least you don’t have to retokenize. It does rely on you having a single BOS token per sample but you could probably find a way to work around it if necessary.

We can shift each row at different rates - covered here. We’ll use their roll_by_gather function for brevity.

We just need to find the offsets of BOS tokens in each row and pass our parameters into the above function:

from transformers import AutoTokenizer

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token

samples = tokenizer([
    "This is a longer text to force padding",
    "Here is a short text"
], return_tensors="pt", padding=True, padding_side="left").input_ids.to(device)

print(tokenizer.batch_decode(samples, skip_special_tokens=False))
# ['<|begin_of_text|>This is a longer text to force padding',
#  '<|end_of_text|><|end_of_text|><|end_of_text|><|begin_of_text|>Here is a short text']

# Get indices of BOS tokens - usually only one per sequence
bos_offsets = -(samples == tokenizer.bos_token_id).to(device).nonzero()[:, -1, None]

right_padded = roll_by_gather(samples, 1, bos_offsets)
print(tokenizer.batch_decode(right_padded, skip_special_tokens=False))
# ['<|begin_of_text|>This is a longer text to force padding',
#  '<|begin_of_text|>Here is a short text<|end_of_text|><|end_of_text|><|end_of_text|>']

wrmthorne · December 18, 2024, 1:34am

Qwen ended my BOS trick’s usefulness pretty quickly so for anybody interested, I’ve found a bit of a hacky work around. Instead of taking the BOS token, we can take the first non-special character token:

# Check each for None first as torch.eq doesn't like None
special_mask = torch.ones_like(samples).to(device)
if tokenizer.bos_token_id is not None:
    special_mask &= ~torch.eq(samples, tokenizer.bos_token_id)
if tokenizer.eos_token_id is not None:
    special_mask &= ~torch.eq(samples, tokenizer.eos_token_id)
if tokenizer.pad_token_id is not None:
    special_mask &= ~torch.eq(samples, tokenizer.pad_token_id)

When calculating the indices, there is a small stumbling block. For tokenizers with no BOS token, we don’t want to keep any preceding tokens but for ones with a BOS token, if we shifted it in the same way, the BOS would be sent to the very end. We can just add the truth value of whether the tokenizer has a BOS token to resolve this:

# Get first non-special indices
offsets = -special_mask.to(torch.int64).argmax(dim=1) + bool(tokenizer.bos_token_id)
indices = offsets[:, None]
right_padded = roll_by_gather(samples, 1, indices)

Topic		Replies	Views
The effect of padding_side 🤗Transformers	13	15001	May 27, 2025
What I know and don't know about sequence to sequence batching 🤗Transformers	3	2037	September 11, 2020
Seeking an end-to-end example of grouping, tokenization and padding to construct preprocessed data in HF 🤗Tokenizers	0	391	June 26, 2023
Asking to pad but the tokenizer does not have a padding token 🤗Tokenizers	0	1690	May 6, 2024
How to set the padding configuration with Huggingface's GenerateMixin's generate method? Intermediate	7	11210	September 26, 2023

Most effiecient way to move padding tokens to the right side of a tensor?

Related topics