Well - this is connected to this question: BertForSequenceClassification only seems to have linear activation at the end - is this a bug?
Why is it only a thing of the loss function? IMO the different classification methods need different last layer activation functions. Binary Class needs sigmoid, one of multiple classes needs softmax and multiple of multi class needs sigmoid again. But somehow you always seem to have a linear (no) activation at the end. @sgugger
Isn’t this a bug?