I’m in the process of training a small GPT2 model on C source code. At the moment I’m trying to get a sense of what it has learned so far by getting it to generate some samples. However, every time I generate samples the output is exactly the same, even though I’m giving it a different seed (based on the current time) every time.
My code is:
#!/usr/bin/env python
import sys
import random
import numpy as np
import time
import torch
from transformers import GPT2Tokenizer
from transformers import GPT2Model, GPT2Config,GPT2LMHeadModel
from transformers.trainer_utils import set_seed
SEED = int(time.time())
set_seed(SEED)
print("Loading tokenizer...")
tokenizer = GPT2Tokenizer.from_pretrained("./csrc_vocab",
additional_special_tokens=["<s>","<pad>","</s>","<unk>","<mask>"],
pad_token='<pad>', max_len=512)
print("Loading model...")
model = GPT2LMHeadModel.from_pretrained(sys.argv[1],
pad_token_id=tokenizer.eos_token_id).to('cuda')
input_ids = tokenizer.encode("int ", return_tensors='pt').to('cuda')
print("Generating...")
gen_output = model.generate(
input_ids,
max_length=128,
temperature=1.1,
repetition_penalty=1.4,
early_stopping=True
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(gen_output[0], skip_special_tokens=True))
How do I properly seed the RNG so that I can get different outputs? I’ve also tried manually seeding with random.seed(), np.random.seed(), and torch.manual_seed(), but the output is always the same.