So I want to take the output of the last hidden state embedding from a BERT model (as in the embedding that comes out once it’s done processing the sequence) and pass that into a classifier head which I tune.
outputs = bert(input_ids=input_ids, attention_mask=attention_mask)
logits = classifier_head_ll(torch.concat((outputs[0][:, 0, :])
The above code I got from a tutorial but now I’m questioning if this is correct. I feel like using outputs.last_hidden_state[:, -1, :]
might be what I want, but also I have concerns cause the last state is also after processing padding, which should be handled by the attention_mask, but if that were the case shouldn’t last_hidden_state[:, -1, :]
== last_hidden_state[:, -2, :]
(which didn’t look like the case when I examined the tensor) for rows that are padded to be the same as the max length? Which hidden state should I use if I want to get the hidden state embedding that is produced at the end of the sequence?