While running the seq2seq examples following the Readme I found that training is relatively fast and uses >50% of the GPU while evaluations (with the exact same batch size) is painfully slow with low GPU utilization. True for both T5 and BART. What am I doing wrong?
As a guess, evaluation requires a generation procedure whereas training uses teacher-forcing and cross-entropy loss. This might not be the case though because the validation step may include some generation - I would highly recommend using the Pycharm debugger to test out what part of the code is taking so long.
Yes, this is correct, eval is slow due to generation
Is it possible to turn off generation but still get validation loss and save the best checkpoint based on that? There is no need for generation if I do not want ROUGE or BLEU scores during training, right?
If you install from source, you can edit finetune.py to do as you describe
The function you want is here, and you could introduce a switch to only generate if called from
Validation loss is also calculated within this function and is the basis upon which checkpoints are ranked (as far as I understand)
I think checkpoints are selected based on
val_metric which can be anything I write code for (currently ROUGE and BLEU are implemented, I already added Levenshtein distance but that also requires generation)
Thank you very much for answering the question.
Could you please explain a little more about what ‘generation’ mean under this context? Personally, I feel like training still involves generation. In a forward pass, after encoder passes context vector to decoder, the decoder will still have to ‘generate’ a sequence. And during the generation of a sequence, for each step, the (cross entropy) loss function can calculate the difference between the softmax-ed distribution of all tokens with the ground truth distribution. If it still requires the whole predicted sequence to be ‘generated’ in order to calculate loss, why would the answer be that ‘generation’ is solely performed for validation and that is the reason for it to be time consuming? I appreciate any feedback on this. Thanks in advance!
Could you please elaborate more on how teacher forcing can speed up the training process? Thank you!