T5 models have non-deterministic outputs even after disabling dropout

I observe that the T5 models have different forward outputs given the same input information, even after disabling the dropout layers. I’m curious is there any other randomness in the forward process except dropout?

(P.S. The above issue occurs when open the train mode by running model.train(). Obviously, if I run model.eval(), the outputs are all the same.)

The following scripts can reproduce the results:

from trl.trainer.utils import disable_dropout_in_model
from datasets import load_dataset

tokenizer = T5Tokenizer.from_pretrained('google-t5/t5-base')
model = T5ForConditionalGeneration.from_pretrained('google-t5/t5-base')

# here I use xsum dataset for summarization
ds = load_dataset('EdinburghNLP/xsum', split='train')
prompt = "Summarize: " + ds[0]['document']

tokenized_dataset = tokenizer(prompt, truncation=True, padding='max_length', max_length=1024, return_tensors='pt')
source_ids = tokenized_dataset['input_ids']
source_mask = tokenized_dataset['attention_mask']

eos_token_id = [tokenizer.eos_token_id]

# open the train mode but disable the dropout
model.train()
disable_dropout_in_model(model)

# generate a random response
outputs = model.generate(input_ids=source_ids, 
                        attention_mask=source_mask, max_length=256, 
                        num_return_sequences=1, do_sample=True, eos_token_id=eos_token_id, temperature=1.0, num_beams=1, 
                        return_dict_in_generate=True,
                        output_scores=True)

# forward model twice with the same inputs
model_forward_1 = model(
    input_ids=source_ids,
    attention_mask=source_mask,
    labels=outputs.sequences,
    return_dict=True,
)
model_forward_2 = model(
    input_ids=source_ids,
    attention_mask=source_mask,
    labels=outputs.sequences,
    return_dict=True,
)

# print and compare the logits in two outputs, you will find they are different.
print(model_forward_1['logits'])
print(model_forward_2['logits'])

hi @jiaweihuang
Because __call__ function(actually _call_impl) from nn.Module uses a random seed each time. Sorry I couldn’t find the exact line but here’s a reference:

Try this:

import torch

....

torch.manual_seed(0)

# forward model twice with the same inputs
model_forward_1 = model(
    input_ids=source_ids,
    attention_mask=source_mask,
    labels=outputs.sequences,
    return_dict=True,
)

torch.manual_seed(0)

model_forward_2 = model(
    input_ids=source_ids,
    attention_mask=source_mask,
    labels=outputs.sequences,
    return_dict=True,
)

# print and compare the logits in two outputs, you will find they are different.
print(model_forward_1['logits'])
print(model_forward_2['logits'])

Hi,

this is because you’re passing do_sample=True to the generate() method, which enforces non-deterministic decoding.

Refer to How to generate text: using different decoding methods for language generation with Transformers for an overview of the different decoding methods. Greedy decoding and beam search are examples of deterministic decoding methods.

Thanks for the response. But I’m still a bit confused about which part of the model.forward has randomness. Given that I have disabled all the dropout layers, even with different random seeds, the output should be the same because each step of the inference is deterministic?

Or maybe I missed something, there is some random mask layers in T5?

Hi, I do not think there is any relationship with the randomness in generate(). I just use generate() to get a valid response.

What I compared is the difference of the outputs when I call model.forward twice with the same inputs (although such inputs is returned by generate()).

Hi,

The from_pretrained method puts a model in evaluation mode by default, disabling things like dropout. So there’s no need to disable those yourself.

The randomness comes solely from the do_sample=True argument, which samples a random token at each time step of the generation. If you don’t pass this argument, greedy decoding is used, which takes the token with the highest probability at each time step.

But we have the same issue if we forward model with random tensors rather than outputs:

model_forward_1 = model(
    input_ids=source_ids,
    attention_mask=source_mask,
    labels=tensor([[1,2,3]]),
    return_dict=True,
)
model_forward_2 = model(
    input_ids=source_ids,
    attention_mask=source_mask,
    labels=tensor([[1,2,3]]),
    return_dict=True,
)

# print and compare the logits in two outputs, you will find they are different.
print(model_forward_1['logits'])
print(model_forward_2['logits'])

I’m not able to reproduce this. The following passes for me:

from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained('google-t5/t5-base')
model = T5ForConditionalGeneration.from_pretrained('google-t5/t5-base')

inputs = tokenizer("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")

# forward model twice with the same inputs
model_forward_1 = model(
    **inputs,
    labels=torch.tensor([[1,2,3]]),
)
model_forward_2 = model(
    **inputs,
    labels=torch.tensor([[1,2,3]]),
)

# print and compare the logits in two outputs, you will find they are different.
assert torch.allclose(model_forward_1['logits'], model_forward_2['logits'])

Hi, I guess I’m still not fully understand…

In my code, although I set do_sample is true, I only generate one output, which is fixed when I do the model.forward twice.
Besides, the logits returned by model.forward would related to the (unnormalized) log probability of tokens, which should be fixed if I infer with the fixed input.

So I do not think the randomness in the model.generation step will result in different outputs in later model.forward step.

indeed even torch.equal returns true. I don’t know how I got two different(but close) results last time, sorry.