Variable length batch decoding

Hi All,

Just want to know, is there any way to batch decode variable length sentences.

For example [S1, S2] , where S1 has 5 words abd S2 has 10 words . Can we decode it using GPT2 , BERT etc?

Hi @s4sarath

What do you mean by decoding, as in decoding the generated tokens by GPT2 or making predictions on a batch of sequences ?

By Decoding I mean, generate sequence of tokens. @valhalla .

All tokenizers offer this functionality, just pass the list of seqs to it

tokens = tokenizer([s1, s2])["input_ids"]

by default it’ll pad all the seqs to the maximum length in the batch if they are of different length. You can find more detailed info in this guide

@valhalla thanks for this. I have seen that. But when I try the pad token in gpt2 it dint work as expected.

There is no pad token for GPT, you can manually set the eos token as pad token using

tokenizer.pad_token = tokenizer.eos_token
1 Like

Hi @valhalla . Thanks for suggestion . I have done as per you say. This is the same thing I have tried before. But as you see in screenshot, the results of variable batch sentences does not produce correct results.

As this is auto regressive model which predicts next token based on previous tokens, it might not generate correct tokens when there are eos in the text.

I thought you were asking about batching at training time. Sorry about the misleading answer.

Right now generate does not support batched generation for gpt2.

Pinging @lysandre

2 Likes

No problem @valhalla . Appreciate your response.
I have implemented this feature locally.

Off topic, would like to know what is your thoughts on this.

Not a TF user :slight_smile:

Great, can you share your fix if possible, lots of other people are interested in batched prediction for GPT.

1 Like

I have implemented in TF2.0 . I had to make quite a bunch of changes to make it work. Will share it.