T5 decoder predicting tokens even after hitting end of sequence token, i.e </s>

I am using T5 model for a seq2seq task. I ensured to replace padding tokens with -100 for labels. The below is my tokenizer configuration

max_source_length = 90
max_target_length = 90
def tokenization_function(batch):
     model_inputs = tokenizer(batch['user_request'], padding="max_length", max_length=max_source_length, truncation=True, return_tensors="pt")
     labels = tokenizer(batch['command'], padding="max_length", max_length=max_target_length, truncation=True, return_tensors="pt")
     model_inputs["decoder_attention_mask"] = labels['attention_mask']
     labels = labels["input_ids"]
     labels[labels == tokenizer.pad_token_id] = -100
     model_inputs["labels"] = labels
     return model_inputs

tokenized_dataset = dataset.map(tokenization_function, batched=True, batch_size=1024)
tokenized_dataset

After training, I do inference using the below script

with torch.no_grad():
    for iter, batch in enumerate(eval_dataloader):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        input_ids = input_ids.to(device); attention_mask = attention_mask.to(device); labels = labels.to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        pred = torch.argmax(outputs['logits'], axis=-1)
        for i, p in enumerate(pred):
            if torch.where(p==1)[0].size(0) != 0:
                idx = torch.where(p==1)[0][0]
                seq = p[:idx].reshape(1,-1)
            else: 
                seq = p.reshape(1,-1)
            pred_text = tokenizer.batch_decode(seq)
            print(batch['command'][i])
            print(pred_text[0])
            print()
        break

for instance, pred[0] has the below value after applying argmax

tensor([ 1041,   834,  6583,   283, 26479,  3876,   834,  6583,     3,  4254,
        25528,    16, 10646,   834,  5540,  5839,   804,   834,  5540, 15959,
         3856,    15,    44, 15959,  3138,     1,     1,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041],
       device='cuda:0')

shouldn’t the auto regression of decoder stop after predicting id: 1, because with my limited knowledge I beleive 1 corresponds to end of token ‘< / s>’. But instead why am I getting 1041 till max length is reached, i.e 90. Is it an desirec output? What should I do to stop my prediction right after token is predicted?

I am a beginner in working on language models, so please feel free to pin point any other issues in the snippets

cc: @nielsr

Hi,

At inference time, it’s recommended to use the generate() method which takes care of autoregressive generation.

See my notebooks regarding fine-tuning T5 for a seq2seq task: Transformers-Tutorials/T5 at master · NielsRogge/Transformers-Tutorials · GitHub. They include an inference section.

2 Likes

Thank you, now I am not getting the arbitary token value after end of sequence, i.e I am not getting 1041. But still I am getting 0s, which corresponds to padding tokens.

Using this I generated generated_ids: generated_ids = model.generate(input_ids, do_sample=False, max_length=max_target_length)

One sample from generated_ids:

tensor([    0,  1041,   834,  6583,   283, 26479,  3876,   834,  6583,     3,
         3463,     4,   382,    16, 10646,   834,  5540,  5839,   804,   834,
         5540, 15959,  3138,  1499,     1,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0], device='cuda:0')

after decoding using tokenizer.batch_decode(generated_ids) I get

<pad> action_para MOVE component_para TEXT intial_state none final_state swapped text</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

Ideally the values should go away right? Am i missing something?

PS: Apologize, it works when I add skip_special_tokens=True in generated_ids = model.generate(input_ids, do_sample=False, max_length=max_target_length, skip_special_tokens=True) as in your example notebooks. Thank you.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

No that seems correct, so the model has generated the end of sequence token (with ID=1), after which generation stops. One usually provides skip_special_tokens=True as well to the batch_decode method in order to skip special tokens (like end of sequence, or padding tokens):

generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
1 Like