How does `tokenizer().input_ids` work and how different it is from tokenizer.encode() before `model.generate()` and decoding step?

copper · January 26, 2023, 12:10pm

How does tokenizer().input_ids work before it gets decoded? I was reading this example 1 and 2 and I see it uses

# encode input context
input_ids = tokenizer(input_context, return_tensors="pt").input_ids
# generate sequences without allowing bad_words to be generated
outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True, bad_words_ids=bad_words_ids)
print("Generated:", tokenizer.decode(outputs["sequences"][0], skip_special_tokens=True))

In the above example, is there any encoding takes place? if yes, where?

I saw in some other example (I cant find the link now) which used tokenizer.encode() something like this

# encode input context
input_ids = tokenizer.encode(input_context, return_tensors="pt")
# generate sequences without allowing bad_words to be generated
outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True, bad_words_ids=bad_words_ids)
print("Generated:", tokenizer.decode(outputs[0], skip_special_tokens=True))

What is the difference between the two methods?

aolney · February 22, 2023, 11:38pm

You can execute the code and see they are the same.

tokenizer() will also return the attention mask, which is why selecting input_ids is necessary to get equivalence.

Topic		Replies	Views
The meaning of 'decoder input ids' in encoder-decoder model Beginners	1	2367	July 29, 2022
I get a "You have to specify either input_ids or inputs_embeds" error, but I do specify the input ids Beginners	6	20845	October 31, 2021
Is there a way to return the "decoder_input_ids" from "tokenizer.prepare_seq2seq_batch"? 🤗Transformers	5	3344	December 29, 2020
Difference between tokenizer and convert_tokens_to_ids 🤗Tokenizers	0	298	May 12, 2024
Decoder generate with prompts of variable lengths? 🤗Transformers	0	661	May 25, 2022

How does `tokenizer().input_ids` work and how different it is from tokenizer.encode() before `model.generate()` and decoding step?

Related topics