Remove only certain special token id during tokenizer decode

I am using a GPT2 based language model to generate some text. My training data has special tokens in them, so I want my model to generate those special tokens as well. The models generated text has a lot of padding token and I was wondering if there is a way to remove them during decoding. One way to solve it would be to pass it through a regular expression/filter and remove all the padding tokens.
Here is an example of the generated text i got after i decoded it. I removed a lot of the pad tokens for this forum post.

'<|begincontext|><|user|>What can I go do I am bored sitting here.<|system|>What do you like to do and where would you like to do it?<|user|>I want something to do in or around NYC and Musical shows are one of my favorites.<|system|>I pulled up a list of 10 so far I think this one would interest you its on March 11th at 7:30 pm at the Kaufmann Concert Hall - Abbi Jacobson will be there.<|user|>That does interest me. Is there a direct bus I can take to get there?<|endcontext|>
\n\n<|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|><|pad|>FindAttractions<|endintent|>\n\n<|beginbelief|>Travel_goodForKids: True Travel_location: NYC<|endaction|>\n\n<|beginaction|>OFFER Travel_attractionName, OFFER Travel_category<|endintent|>\n\n<|beginresponse|>The Ac 9 Hotel By Marriott Brooklyn Bridge is a fine tourist spot.<|endintent|>\n\n<|pad|><|pad|>OFFER_attractionName, OFFER Travel_category<|endresponse|><|pad|><|pad|><|pad|><|pad|>'
1 Like

Hi, did you manage to find a proper solution for this except removing other special tokens manually in an iterative way?

unfortunately I was not able to find a proper solution to this.
Initially I had used a regular expressions to remove certain strings but later I decided to tokenize the text and use a list comprehension to remove by token ids.

Could you paste your code? Maybe it would be more details there.