GPT2 Generated Output Always the Same?

I’m in the process of training a small GPT2 model on C source code. At the moment I’m trying to get a sense of what it has learned so far by getting it to generate some samples. However, every time I generate samples the output is exactly the same, even though I’m giving it a different seed (based on the current time) every time.

My code is:

#!/usr/bin/env python

import sys
import random
import numpy as np
import time
import torch
from transformers import GPT2Tokenizer
from transformers import GPT2Model, GPT2Config,GPT2LMHeadModel
from transformers.trainer_utils import set_seed

SEED = int(time.time())
set_seed(SEED)

print("Loading tokenizer...")
tokenizer = GPT2Tokenizer.from_pretrained("./csrc_vocab",
        additional_special_tokens=["<s>","<pad>","</s>","<unk>","<mask>"],
        pad_token='<pad>', max_len=512)

print("Loading model...")
model = GPT2LMHeadModel.from_pretrained(sys.argv[1],
        pad_token_id=tokenizer.eos_token_id).to('cuda')

input_ids = tokenizer.encode("int ", return_tensors='pt').to('cuda')

print("Generating...")
gen_output = model.generate(
    input_ids,
    max_length=128,
    temperature=1.1,
    repetition_penalty=1.4,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(gen_output[0], skip_special_tokens=True))

How do I properly seed the RNG so that I can get different outputs? I’ve also tried manually seeding with random.seed(), np.random.seed(), and torch.manual_seed(), but the output is always the same.

Hi @moyix!

I believe the set_seed() method being called is for the random processes that happen inside the Trainer class that is used for training and finetuning HF models. So, naively, I would say that calling set_seed() to generate different output from the nominal GPT2 won’t work.

Unfortunately, I can’t think of a way to do this. Here is an article by @patrickvonplaten about generating text with different decoder methods that might be useful. Otherwise, maybe @sgugger can provide some insight?

hi @moyix

You need to turn on sampling, by passing do_sample=True to generate method. By default it does greedy decoding so the output will be same.

1 Like

Aha! Can’t believe I missed that even after reading that article before and looking through the generate() docs. Thank you!