Hello! Is there a way to batch_decode on a minibatch of tokenized text samples to get the actual input text, but with sentence1 and sentence2 as separated? What I mean is that: currently batch_decode returns the required text but with a whole lot of special tokens by default (PAD, CLS, SEP etc etc). I know there is the skip_special_tokens param (Utilities for Tokenizers) which can help remove these unwanted tokens, but unfortunately a by-product of that is also that the special SEP token is also removed - which means in the returned special token free text there’s no way to split and get decoded sentence1 and sentence2 as separate sentences and both are concatenated.
Is there some way to clear these other unwanted tokens (PAD, CLS etc) but leave SEP in the batch_decode (or if there’s any alternative method already available for this use-case?) - so we can get the decoded sentence1 and sentence2 separately back? Can someone please help if possible?
@lewtun: I came across many of your insightful posts/answers in the community. If you could please help out with the above if possible, that’d be so helpful and awesome!