T5 forward pass versus generate, latter outputs non-sense

ScientiaEtVeritas · March 21, 2021, 2:12pm

Hi, at training, I’m using the forward pass and batch_decode on the logits to get the decoded output:

    outputs = model(
        input_ids,
        attention_mask,
        dec_input_ids,
        dec_attention_mask,
        labels=dec_input_ids,
    )

    loss, logits = outputs.loss, outputs.logits
    decoded_output = tokenizer.batch_decode(torch.argmax(outputs.logits, dim=2).tolist(), skip_special_tokens=True)

And decoded_output seems to comply with what I trained the model on:

bread dough ; side surface

However, I’ve noticed that using model.generate produces non-sense:

    generated = model.generate(input_ids)
    tokenizer.decode(generated[0], skip_special_tokens=True))

table table table table table table table table table table table table table table table table table table

Note that it is the same model instance, as well as the same input_ids (this way it can’t be related to saving/loading issues, and I guess it also eliminates the possibility of encoding/tokenization issues for input_ids).

Background: model is of class T5ForConditionalGeneration and initialized with t5-small.

What’s the problem here? I’ve used the EncoderDecoderModel in the very same way, and there, model.generate works as expected.

deathcrush · March 22, 2021, 12:27pm

It may help to know the exact structure of input_ids (e.g., dimensions) and what is the context for each batch element?

ScientiaEtVeritas · March 22, 2021, 1:05pm

input_ids is of shape (batch_size, max_len), i.e., torch.Size([16, 100]).

I’m not sure what you mean by context though.

This is the training routine including the output comparison:

  for d in data_loader:
     input_ids = d['enc_input_ids'].to(device)
     attention_mask = d['enc_attention_mask'].to(device)
     dec_input_ids = d['dec_input_ids'].to(device)
     dec_attention_mask = d['dec_attention_mask'].to(device)

     outputs = model(
        input_ids,
        attention_mask,
        dec_input_ids,
        dec_attention_mask,
        labels=dec_input_ids,
     )

    loss, logits = outputs.loss, outputs.logits

    decoded_output = tokenizer.batch_decode(torch.argmax(outputs.logits, dim=2).tolist(), skip_special_tokens=True)

    # Compare outputs...
    for i, output in enumerate(decoded_output)):
        if random.random() < 0.05: # only show a few random examples
            generated = model.generate(input_ids, decoder_start_token_id=model.config.pad_token_id)
            print("Forward pass output: ", output)
            print("Generate output: ", tokenizer.decode(generated[i], skip_special_tokens=True))

    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

deathcrush · March 22, 2021, 3:12pm

In general, context refers to the input that a model has to make a prediction on (e.g., the beginning of a sentence in a next word prediction task). So, I should ask, what sort of task are you training on? Is it just span corruption prediction (i.e., the original T5 training objective)?

Regarding your code. One does not expect the generate output to be identical to the training output. One reason for that is that the generate API is used to perform a task in inference mode (i.e., after we finished training the model). This means that layers that behave different in training vs inference (e.g., dropout) will have different behaviours so the model will not produce the same output.

So, I suggest fine tuning the model on your task and then calling the generate API. It could be that you are checking the behaviour early on when the model has not had time to learn anything.

ScientiaEtVeritas · March 22, 2021, 3:29pm

The only reason why I put generate there is for debugging purposes because I noticed that the model produced non-sense in inference mode while achieving solid accuracies on the training and validation dataset. I wanted to make sure it’s not related to saving/loading issues or encoding/tokenization issues by using the very same model and input_ids. I also observed the outputs in training at early and late stages - generate keeps producing non-sense.

Note that I was able to successfully use the same code with the EncoderDecoderModel, I only had to switch the initialization of the tokenizer and the model.

I’m using the T5 pre-trained model as an init for a general sequence-to-sequence task, a very simple extraction task, still a toy problem, the sources are always of the form put x into y with x ; y being the targets. So the model should be able to reach near-perfect accuracies.

deathcrush · March 24, 2021, 11:05am

Ok, I do not know much about the EncoderDecoderModel so cannot comment on that. One thing I would like to understand is your setting for decoder_start_token_id. How come you decided to go with the pad token as opposed to a special token (e.g., <s> being a common one in translation). Is this how they train T5 to start with?

ScientiaEtVeritas · March 24, 2021, 8:23pm

The huggingface documentation of T5 (see Training) says:

The PAD token is hereby used as the start-sequence token.

Since decoder_start_token_id is an optional parameter for T5, I also tried it without setting it at all - but with the same result (I suspect the pad token is set as default).

deathcrush · March 25, 2021, 12:31pm

Ok, so this is not the issue then. Did you also try to run the forward pass under a context manager with torch.no_grad() and got sensible outputs too?

deathcrush · March 25, 2021, 1:06pm

I happened to stumble across this post, which has a similar flavour to your error. Have a look!

Topic		Replies	Views
T5 Model Generate and Model Outputs Vastly Different Beginners	1	700	September 11, 2022
T5 - model.generate() issue Beginners	2	653	March 18, 2024
Untrained T5 model outputting logits that argmax to the decoder_input_ids Beginners	0	492	September 28, 2022
The output of T5 is not consistent on multiple sequences 🤗Transformers	1	829	May 11, 2022
Extracting Logits From T5 Output Beginners	5	1962	January 9, 2024

T5 forward pass versus generate, latter outputs non-sense

Related topics