For tuning a classifier head on a pretrained BERT should I use `last_hidden_state` or `outputs[0][:, 0, :]` from the BERT?

So I want to take the output of the last hidden state embedding from a BERT model (as in the embedding that comes out once it’s done processing the sequence) and pass that into a classifier head which I tune.

outputs = bert(input_ids=input_ids, attention_mask=attention_mask)
logits = classifier_head_ll(torch.concat((outputs[0][:, 0, :])

The above code I got from a tutorial but now I’m questioning if this is correct. I feel like using outputs.last_hidden_state[:, -1, :] might be what I want, but also I have concerns cause the last state is also after processing padding, which should be handled by the attention_mask, but if that were the case shouldn’t last_hidden_state[:, -1, :] == last_hidden_state[:, -2, :] (which didn’t look like the case when I examined the tensor) for rows that are padded to be the same as the max length? Which hidden state should I use if I want to get the hidden state embedding that is produced at the end of the sequence?