How does tokenizer().input_ids
work before it gets decoded? I was reading this example 1 and 2 and I see it uses
# encode input context
input_ids = tokenizer(input_context, return_tensors="pt").input_ids
# generate sequences without allowing bad_words to be generated
outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True, bad_words_ids=bad_words_ids)
print("Generated:", tokenizer.decode(outputs["sequences"][0], skip_special_tokens=True))
In the above example, is there any encoding takes place? if yes, where?
I saw in some other example (I cant find the link now) which used tokenizer.encode()
something like this
# encode input context
input_ids = tokenizer.encode(input_context, return_tensors="pt")
# generate sequences without allowing bad_words to be generated
outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True, bad_words_ids=bad_words_ids)
print("Generated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
What is the difference between the two methods?