T5 forward pass versus generate, latter outputs non-sense

Hi, at training, I’m using the forward pass and batch_decode on the logits to get the decoded output:

    outputs = model(
        input_ids,
        attention_mask,
        dec_input_ids,
        dec_attention_mask,
        labels=dec_input_ids,
    )

    loss, logits = outputs.loss, outputs.logits
    decoded_output = tokenizer.batch_decode(torch.argmax(outputs.logits, dim=2).tolist(), skip_special_tokens=True)

And decoded_output seems to comply with what I trained the model on:

bread dough ; side surface

However, I’ve noticed that using model.generate produces non-sense:

    generated = model.generate(input_ids)
    tokenizer.decode(generated[0], skip_special_tokens=True))

table table table table table table table table table table table table table table table table table table

Note that it is the same model instance, as well as the same input_ids (this way it can’t be related to saving/loading issues, and I guess it also eliminates the possibility of encoding/tokenization issues for input_ids).

Background: model is of class T5ForConditionalGeneration and initialized with t5-small.

What’s the problem here? I’ve used the EncoderDecoderModel in the very same way, and there, model.generate works as expected.

It may help to know the exact structure of input_ids (e.g., dimensions) and what is the context for each batch element?

input_ids is of shape (batch_size, max_len), i.e., torch.Size([16, 100]).

I’m not sure what you mean by context though.

This is the training routine including the output comparison:

  for d in data_loader:
     input_ids = d['enc_input_ids'].to(device)
     attention_mask = d['enc_attention_mask'].to(device)
     dec_input_ids = d['dec_input_ids'].to(device)
     dec_attention_mask = d['dec_attention_mask'].to(device)

     outputs = model(
        input_ids,
        attention_mask,
        dec_input_ids,
        dec_attention_mask,
        labels=dec_input_ids,
     )

    loss, logits = outputs.loss, outputs.logits

    decoded_output = tokenizer.batch_decode(torch.argmax(outputs.logits, dim=2).tolist(), skip_special_tokens=True)

    # Compare outputs...
    for i, output in enumerate(decoded_output)):
        if random.random() < 0.05: # only show a few random examples
            generated = model.generate(input_ids, decoder_start_token_id=model.config.pad_token_id)
            print("Forward pass output: ", output)
            print("Generate output: ", tokenizer.decode(generated[i], skip_special_tokens=True))

    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

In general, context refers to the input that a model has to make a prediction on (e.g., the beginning of a sentence in a next word prediction task). So, I should ask, what sort of task are you training on? Is it just span corruption prediction (i.e., the original T5 training objective)?

Regarding your code. One does not expect the generate output to be identical to the training output. One reason for that is that the generate API is used to perform a task in inference mode (i.e., after we finished training the model). This means that layers that behave different in training vs inference (e.g., dropout) will have different behaviours so the model will not produce the same output.

So, I suggest fine tuning the model on your task and then calling the generate API. It could be that you are checking the behaviour early on when the model has not had time to learn anything.

The only reason why I put generate there is for debugging purposes because I noticed that the model produced non-sense in inference mode while achieving solid accuracies on the training and validation dataset. I wanted to make sure it’s not related to saving/loading issues or encoding/tokenization issues by using the very same model and input_ids. I also observed the outputs in training at early and late stages - generate keeps producing non-sense.

Note that I was able to successfully use the same code with the EncoderDecoderModel, I only had to switch the initialization of the tokenizer and the model.

I’m using the T5 pre-trained model as an init for a general sequence-to-sequence task, a very simple extraction task, still a toy problem, the sources are always of the form put x into y with x ; y being the targets. So the model should be able to reach near-perfect accuracies.

Ok, I do not know much about the EncoderDecoderModel so cannot comment on that. One thing I would like to understand is your setting for decoder_start_token_id. How come you decided to go with the pad token as opposed to a special token (e.g., <s> being a common one in translation). Is this how they train T5 to start with?

The huggingface documentation of T5 (see Training) says:

The PAD token is hereby used as the start-sequence token.

Since decoder_start_token_id is an optional parameter for T5, I also tried it without setting it at all - but with the same result (I suspect the pad token is set as default).

Ok, so this is not the issue then. Did you also try to run the forward pass under a context manager with torch.no_grad() and got sensible outputs too?

I happened to stumble across this post, which has a similar flavour to your error. Have a look!