I trained a question answering model based on the squad dataset. However, regardless on the model architecture I use (Electra, Bert, Roberta, etc.). There are cases when the model predicts an ‘start logit’ greater than the ‘end logit’.
Why is this the case? All the samples in the squad dataset have an end position greater than the start position.