Issue with Decoding in HuggingFace

ashutoshsaboo · March 14, 2022, 5:17pm

Hello! Is there a way to batch_decode on a minibatch of tokenized text samples to get the actual input text, but with sentence1 and sentence2 as separated? What I mean is that: currently batch_decode returns the required text but with a whole lot of special tokens by default (PAD, CLS, SEP etc etc). I know there is the skip_special_tokens param (Utilities for Tokenizers) which can help remove these unwanted tokens, but unfortunately a by-product of that is also that the special SEP token is also removed - which means in the returned special token free text there’s no way to split and get decoded sentence1 and sentence2 as separate sentences and both are concatenated.

Is there some way to clear these other unwanted tokens (PAD, CLS etc) but leave SEP in the batch_decode (or if there’s any alternative method already available for this use-case?) - so we can get the decoded sentence1 and sentence2 separately back? Can someone please help if possible?

@lewtun: I came across many of your insightful posts/answers in the community. If you could please help out with the above if possible, that’d be so helpful and awesome!

Topic		Replies	Views
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4423	March 11, 2025
How to decode with spaces? 🤗Tokenizers	0	1878	April 28, 2022
Efficient detokenization method 🤗Transformers	3	2061	January 28, 2021
Variable length batch decoding 🤗Transformers	11	3968	March 31, 2024
Remove only certain special token id during tokenizer decode 🤗Tokenizers	3	2618	October 26, 2022

Issue with Decoding in HuggingFace

Related topics