I have recreated model.greedy_search()
in 2 different ways, with the main difference being the size of input_ids
.
Model Initialization
import torch
import transformers
# USER CONFIGURATIONS
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "EleutherAI/gpt-neo-1.3B"
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_quant_type = "nf4",
bnb_4bit_use_double_quant = True,
bnb_4bit_compute_dtype = torch.bfloat16
)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, cache_dir = "./Models/")
model = transformers.AutoModelForCausalLM.from_pretrained(model_name, quantization_config = bnb_config, cache_dir = "./Models/")
Generation Configurations
# GENERATION INPUTS
num_gen_tokens = 20
prompt = "ewrcewkr oewrkcl ewrkewr\n"
input_ids = tokenizer(prompt, return_tensors = "pt").to(device).input_ids.squeeze() # batched = False
# CONFIRM PRE-TRAINED CONFIGURATIONS
model.generation_config.pad_token_id = model.generation_config.eos_token_id
assert tokenizer.bos_token_id == model.generation_config.bos_token_id
assert tokenizer.eos_token_id == model.generation_config.eos_token_id
assert not ((model.generation_config.eos_token_id is None) ^ (model.generation_config.pad_token_id is None))
Variation 1: My code, where input_ids.shape = Size(num_input_tokens)
output_ids = input_ids.clone().detach()
model.eval()
with torch.no_grad():
for _ in range(num_gen_tokens):
if model.generation_config.eos_token_id is None or output_ids[-1] != model.generation_config.eos_token_id:
outputs = model(output_ids)
next_token_logits = outputs.logits[-1] # only consider the logits output based on last token of input
next_tokens = next_token_logits.argmax(dim = -1).unsqueeze(dim = -1)
output_ids = torch.cat((output_ids, next_tokens), dim = -1)
print(tokenizer.decode(output_ids))
Variation 2: My code, where input_ids.shape = Size(1, num_input_tokens)
output_ids = input_ids.clone().detach().unsqueeze(dim = 0)
model.eval()
with torch.no_grad():
for _ in range(num_gen_tokens):
if model.generation_config.eos_token_id is None or output_ids[:, -1] != model.generation_config.eos_token_id:
outputs = model(output_ids)
next_token_logits = outputs.logits[:, -1] # only consider the logits output based on last token of input
next_tokens = next_token_logits.argmax(dim = -1).unsqueeze(dim = -1)
output_ids = torch.cat((output_ids, next_tokens), dim = -1)
print(tokenizer.decode(output_ids.squeeze()))
Variation 3: HuggingFace API, where input_ids.shape = Size(1, num_input_tokens)
output = model.greedy_search(input_ids.clone().detach().unsqueeze(dim = 0), stopping_criteria = transformers.StoppingCriteriaList([transformers.MaxLengthCriteria(max_length = 20 + input_ids.size(dim = -1))])).squeeze()
print(tokenizer.decode(output))
In most cases, the generated tokens that are returned should be the same in all 3 methods utilized. However there are 2 cases I found that seems to violate this rule (note that only the prompt
was changed, the rest of the variables remained the same):
Case 1: prompt = "ewrcewkr oewrkcl ewrkewr\n"
, Variation 2 seems to be the odd one out
- Variation 1’s output
ewrcewkr oewrkcl ewrkewr
I am a very simple person. I love to read, watch movies, and play video games
- Variation 2’s output
ewrcewkr oewrkcl ewrkewr
I am a very simple person. I am very easy going and I like to be around people
- Variation 3’s output
ewrcewkr oewrkcl ewrkewr
I am a very simple person. I love to read, watch movies, and play video games
Case 2: prompt = tokenizer.bos_token + "ewrcewkr oewrkcl ewrkewr\n"
, Variation 3 seems to be the odd one out
- Variation 1’s output
<|endoftext|>ewrcewkr oewrkcl ewrkewr
wewrcewkr oewrkcl ewrkewr
(a
- Variation 2’s output
<|endoftext|>ewrcewkr oewrkcl ewrkewr
wewrcewkr oewrkcl ewrkewr
(a
- Variation 3’s output
<|endoftext|>ewrcewkr oewrkcl ewrkewr
The following is a list of the most common words in the English language.
The most
I have 3 questions regarding the difference in outputs (as seen above):
- What should be the expected
input_shape
intomodel.forward()
? Is itSize(1, num_input_tokens)
, orSize(num_input_tokens)
? If the input is ofSize(1, num_input_tokens)
,outputs.logits
would haveSize(torch.Size([1, num_output_tokens, num_tokenizer_tokens])
. If the input is ofSize(num_input_tokens)
,outputs.logits
would haveSize(torch.Size([num_output_tokens ** 2, num_tokenizer_tokens])
. - Does my code correctly model how the LLM decodes the output via greedy search?
- What is causing this difference in decoded output across all 3 methods used?
Thank you in advance.