According to the explanation here -
all Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross-entropy
Can anyone please explain that? What does it mean to “fuse the last activation function, such as SoftMax, with the actual loss function…” ?