The code is designed to pad all tokens to the same length, equivalent to the maximum length of one input. However, it’s important to note that during inference, the output lengths of different inputs can vary.
After testing the code with nine sentences, where the input lengths range from 32 to 1140, it was observed that model.generate(inputs_padded)
completed after only one forward pass (decoding). This suggests that the model didn’t perform decoding correctly.
An additional attempt was made using model.generate(inputs_padded, max_new_tokens=64)
. However, this resulted in an error related to CUDA. This is probably because some sentences has completed generation while others have not.
Any suggestions to solve these problems?