Get vocabulary tokens in order to exclude them from generate function

I want to get the vocabulary ids of some phrases in order to exclude these ids from text generation with GPT-2.

I use AutoConfig and AutoTokenizer and when I am trying to get the ids that I want to exclude with

tokenizer(bad_word, add_prefix_space=True).input_ids

as it says in the bad_words_ids argument of [generate](Models — transformers 4.4.2 documentation) function I get the error:

_batch_encode_plus() got an unexpected keyword argument 'add_prefix_space'

Do I have to use this argument and why is this error thrown?

1 Like

I’m running into a similar situation. It seems that the pre-trained GPT2 models use PreTrainedTokenizerFast rather than the regular GPT2Tokenizer. PreTrainedTokenizerFast does not accept the ‘add_prefix_space’ argument.

Does this mean that it’s not possible to tokenize a bad_words_id list with the pretrained GPT2 models? I’m a little lost myself!

Hey,
instead of defining it when you execute the tokenizer, you can try to define it when you define the tokenizer itself. See AutoTokenizer _batch_encode_plus method don't have add_prefix_space argument · Issue #17391 · huggingface/transformers · GitHub