Hi All,
Just want to know, is there any way to batch decode variable length sentences.
For example [S1, S2] , where S1 has 5 words abd S2 has 10 words . Can we decode it using GPT2 , BERT etc?
Hi All,
Just want to know, is there any way to batch decode variable length sentences.
For example [S1, S2] , where S1 has 5 words abd S2 has 10 words . Can we decode it using GPT2 , BERT etc?
Hi @s4sarath
What do you mean by decoding, as in decoding the generated tokens by GPT2 or making predictions on a batch of sequences ?
By Decoding I mean, generate sequence of tokens. @valhalla .
All tokenizers
offer this functionality, just pass the list of seqs to it
tokens = tokenizer([s1, s2])["input_ids"]
by default it’ll pad all the seqs to the maximum length in the batch if they are of different length. You can find more detailed info in this guide
@valhalla thanks for this. I have seen that. But when I try the pad token in gpt2 it dint work as expected.
There is no pad token for GPT, you can manually set the eos
token as pad
token using
tokenizer.pad_token = tokenizer.eos_token
Hi @valhalla . Thanks for suggestion . I have done as per you say. This is the same thing I have tried before. But as you see in screenshot, the results of variable batch sentences does not produce correct results.
As this is auto regressive model which predicts next token based on previous tokens, it might not generate correct tokens when there are eos in the text.
I thought you were asking about batching at training time. Sorry about the misleading answer.
Right now generate
does not support batched generation for gpt2.
Pinging @lysandre
No problem @valhalla . Appreciate your response.
I have implemented this feature locally.
Off topic, would like to know what is your thoughts on this.
Not a TF user
Great, can you share your fix if possible, lots of other people are interested in batched prediction for GPT.
I have implemented in TF2.0 . I had to make quite a bunch of changes to make it work. Will share it.
Hi, is there any updates? Thanks!