Issue with Decoding in HuggingFace

ashutoshsaboo · March 14, 2022, 5:17pm

Hello! Is there a way to batch_decode on a minibatch of tokenized text samples to get the actual input text, but with sentence1 and sentence2 as separated? What I mean is that: currently batch_decode returns the required text but with a whole lot of special tokens by default (PAD, CLS, SEP etc etc). I know there is the skip_special_tokens param (Utilities for Tokenizers) which can help remove these unwanted tokens, but unfortunately a by-product of that is also that the special SEP token is also removed - which means in the returned special token free text there’s no way to split and get decoded sentence1 and sentence2 as separate sentences and both are concatenated.

Is there some way to clear these other unwanted tokens (PAD, CLS etc) but leave SEP in the batch_decode (or if there’s any alternative method already available for this use-case?) - so we can get the decoded sentence1 and sentence2 separately back? Can someone please help if possible?

@lewtun: I came across many of your insightful posts/answers in the community. If you could please help out with the above if possible, that’d be so helpful and awesome!

lewtun · March 18, 2022, 2:56pm

Hey @ashutoshsaboo I am not aware of a built in method to achieve what you want, but can’t you just slide out the first and last tokens from the batch_decode() function?

Here’s a simple example to show what I mean:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
inputs = tokenizer("My name is Alice", "My name is Bob")
outputs = tokenizer.batch_decode(inputs["input_ids"])
# Returns ['My', 'name', 'is', 'Alice', '[SEP]', 'My', 'name', 'is', 'Bob']
outputs[1:-1]

If that’s not what you’re after, perhaps you can share an example of the inputs and the desired outputs?

ashutoshsaboo · March 24, 2022, 11:22am

First and Last tokens take care of only CLS and last SEP (if it exists). What about PAD token which can be dynamically padded in a batch? Any easy native way to strip those too? @lewtun

Of-course, you could do with python string operations (stripping of PAD tokens might cause redundant spaces too which you might have to strip again) - I’m currently using re for the same. But I initially thought this would be a quite common use-case and would ideally have a native method to do this. Alas, not unfortunately!

If a filter of tokens to strip can be passed to batch_decode that should have done the job ideally.

Topic		Replies	Views
Disabling addition of CLS from BERT tokenizer 🤗Tokenizers	5	1764	March 11, 2022
Decoder generate with prompts of variable lengths? 🤗Transformers	0	661	May 25, 2022
Remove only certain special token id during tokenizer decode 🤗Tokenizers	3	2556	October 26, 2022
SentencePiece tokenizer Beginners	2	122	February 22, 2025
Variable length batch decoding 🤗Transformers	11	3923	March 31, 2024

Issue with Decoding in HuggingFace

Related topics