Get vocabulary tokens in order to exclude them from generate function

hfnlpmb · March 29, 2021, 6:22pm

I want to get the vocabulary ids of some phrases in order to exclude these ids from text generation with GPT-2.

I use AutoConfig and AutoTokenizer and when I am trying to get the ids that I want to exclude with

tokenizer(bad_word, add_prefix_space=True).input_ids

as it says in the bad_words_ids argument of [generate](Models — transformers 4.4.2 documentation) function I get the error:

_batch_encode_plus() got an unexpected keyword argument 'add_prefix_space'

Do I have to use this argument and why is this error thrown?

materialbeing · February 22, 2022, 11:52am

I’m running into a similar situation. It seems that the pre-trained GPT2 models use PreTrainedTokenizerFast rather than the regular GPT2Tokenizer. PreTrainedTokenizerFast does not accept the ‘add_prefix_space’ argument.

Does this mean that it’s not possible to tokenize a bad_words_id list with the pretrained GPT2 models? I’m a little lost myself!

felfri · August 1, 2022, 2:13pm

Hey,
instead of defining it when you execute the tokenizer, you can try to define it when you define the tokenizer itself. See AutoTokenizer _batch_encode_plus method don't have add_prefix_space argument · Issue #17391 · huggingface/transformers · GitHub

Topic		Replies	Views
Exclude words from GPT-2 generate( ) 🤗Transformers	3	1754	April 26, 2023
GPT2: many bad_words_ids leading to slow text generation? Intermediate	0	1541	September 4, 2021
Prohibit GPT-2 from generating some words on a condition 🤗Transformers	7	1113	April 25, 2021
Removing tokens from the GPT tokenizer 🤗Transformers	2	1971	August 20, 2024
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	43	April 22, 2025

Get vocabulary tokens in order to exclude them from generate function

Related topics