Issue with Decoding in HuggingFace

Hello! Is there a way to batch_decode on a minibatch of tokenized text samples to get the actual input text, but with sentence1 and sentence2 as separated? What I mean is that: currently batch_decode returns the required text but with a whole lot of special tokens by default (PAD, CLS, SEP etc etc). I know there is the skip_special_tokens param (Utilities for Tokenizers) which can help remove these unwanted tokens, but unfortunately a by-product of that is also that the special SEP token is also removed - which means in the returned special token free text there’s no way to split and get decoded sentence1 and sentence2 as separate sentences and both are concatenated.

Is there some way to clear these other unwanted tokens (PAD, CLS etc) but leave SEP in the batch_decode (or if there’s any alternative method already available for this use-case?) - so we can get the decoded sentence1 and sentence2 separately back? Can someone please help if possible?

@lewtun: I came across many of your insightful posts/answers in the community. If you could please help out with the above if possible, that’d be so helpful and awesome! :smile:

Hey @ashutoshsaboo I am not aware of a built in method to achieve what you want, but can’t you just slide out the first and last tokens from the batch_decode() function?

Here’s a simple example to show what I mean:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
inputs = tokenizer("My name is Alice", "My name is Bob")
outputs = tokenizer.batch_decode(inputs["input_ids"])
# Returns ['My', 'name', 'is', 'Alice', '[SEP]', 'My', 'name', 'is', 'Bob']
outputs[1:-1]

If that’s not what you’re after, perhaps you can share an example of the inputs and the desired outputs?

First and Last tokens take care of only CLS and last SEP (if it exists). What about PAD token which can be dynamically padded in a batch? Any easy native way to strip those too? @lewtun

Of-course, you could do with python string operations (stripping of PAD tokens might cause redundant spaces too which you might have to strip again) - I’m currently using re for the same. But I initially thought this would be a quite common use-case and would ideally have a native method to do this. Alas, not unfortunately!

If a filter of tokens to strip can be passed to batch_decode that should have done the job ideally.