Most effiecient way to move padding tokens to the right side of a tensor?

Hi, I was wondering the most efficient way to move the padding tokens in a given tensor to the rightmost? A more concrete example is illustrated below.

Suppose the padded tensor is the following one.

x = [
    [PAD, X, X, X, PAD],
    [PAD, PAD, X, X, X],
    [X, X, X, X, X]
]

I’d like the output to be:

x = [
    [X, X, X, PAD, PAD],
    [X, X, X, PAD, PAD],
    [X, X, X, X, X]
]

Thanks in advance for your help!

1 Like

A pretty old thread but I could have used the answer for the last hour or so. I can’t say its the most efficient but at least you don’t have to retokenize. It does rely on you having a single BOS token per sample but you could probably find a way to work around it if necessary.

We can shift each row at different rates - covered here. We’ll use their roll_by_gather function for brevity.

We just need to find the offsets of BOS tokens in each row and pass our parameters into the above function:

from transformers import AutoTokenizer

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token

samples = tokenizer([
    "This is a longer text to force padding",
    "Here is a short text"
], return_tensors="pt", padding=True, padding_side="left").input_ids.to(device)

print(tokenizer.batch_decode(samples, skip_special_tokens=False))
# ['<|begin_of_text|>This is a longer text to force padding',
#  '<|end_of_text|><|end_of_text|><|end_of_text|><|begin_of_text|>Here is a short text']

# Get indices of BOS tokens - usually only one per sequence
bos_offsets = -(samples == tokenizer.bos_token_id).to(device).nonzero()[:, -1, None]

right_padded = roll_by_gather(samples, 1, bos_offsets)
print(tokenizer.batch_decode(right_padded, skip_special_tokens=False))
# ['<|begin_of_text|>This is a longer text to force padding',
#  '<|begin_of_text|>Here is a short text<|end_of_text|><|end_of_text|><|end_of_text|>']
1 Like

Qwen ended my BOS trick’s usefulness pretty quickly so for anybody interested, I’ve found a bit of a hacky work around. Instead of taking the BOS token, we can take the first non-special character token:

# Check each for None first as torch.eq doesn't like None
special_mask = torch.ones_like(samples).to(device)
if tokenizer.bos_token_id is not None:
    special_mask &= ~torch.eq(samples, tokenizer.bos_token_id)
if tokenizer.eos_token_id is not None:
    special_mask &= ~torch.eq(samples, tokenizer.eos_token_id)
if tokenizer.pad_token_id is not None:
    special_mask &= ~torch.eq(samples, tokenizer.pad_token_id)

When calculating the indices, there is a small stumbling block. For tokenizers with no BOS token, we don’t want to keep any preceding tokens but for ones with a BOS token, if we shifted it in the same way, the BOS would be sent to the very end. We can just add the truth value of whether the tokenizer has a BOS token to resolve this:

# Get first non-special indices
offsets = -special_mask.to(torch.int64).argmax(dim=1) + bool(tokenizer.bos_token_id)
indices = offsets[:, None]
right_padded = roll_by_gather(samples, 1, indices)
1 Like