I posted a longer version of above with reproducible code: deep learning - Sequence to Sequence Loss - Stack Overflow