How does `tokenizer().input_ids` work and how different it is from tokenizer.encode() before `model.generate()` and decoding step?

How does tokenizer().input_ids work before it gets decoded? I was reading this example 1 and 2 and I see it uses

# encode input context
input_ids = tokenizer(input_context, return_tensors="pt").input_ids
# generate sequences without allowing bad_words to be generated
outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True, bad_words_ids=bad_words_ids)
print("Generated:", tokenizer.decode(outputs["sequences"][0], skip_special_tokens=True))

In the above example, is there any encoding takes place? if yes, where?

I saw in some other example (I cant find the link now) which used tokenizer.encode() something like this

# encode input context
input_ids = tokenizer.encode(input_context, return_tensors="pt")
# generate sequences without allowing bad_words to be generated
outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True, bad_words_ids=bad_words_ids)
print("Generated:", tokenizer.decode(outputs[0], skip_special_tokens=True))

What is the difference between the two methods?

1 Like

You can execute the code and see they are the same.

tokenizer() will also return the attention mask, which is why selecting input_ids is necessary to get equivalence.