Why my simple Bert model for text classification could not learn anything?

I think the problem might be that you call optimizer.zero_grad() after outputs are calculated, and it zeros out the gradients from the forward pass. Try putting that line before the line where outputs are calculated.