For tuning a classifier head on a pretrained BERT should I use `last_hidden_state` or `outputs[0][:, 0, :]` from the BERT?

DolanTheMFWizard · February 15, 2024, 4:44am

So I want to take the output of the last hidden state embedding from a BERT model (as in the embedding that comes out once it’s done processing the sequence) and pass that into a classifier head which I tune.

outputs = bert(input_ids=input_ids, attention_mask=attention_mask)
logits = classifier_head_ll(torch.concat((outputs[0][:, 0, :])

The above code I got from a tutorial but now I’m questioning if this is correct. I feel like using outputs.last_hidden_state[:, -1, :] might be what I want, but also I have concerns cause the last state is also after processing padding, which should be handled by the attention_mask, but if that were the case shouldn’t last_hidden_state[:, -1, :] == last_hidden_state[:, -2, :] (which didn’t look like the case when I examined the tensor) for rows that are padded to be the same as the max length? Which hidden state should I use if I want to get the hidden state embedding that is produced at the end of the sequence?

Topic		Replies	Views
Question about last_hidden_state of the bert model Beginners	0	332	December 7, 2023
MaskedLMOutput does not have last_hidden_state 🤗Transformers	0	1630	May 27, 2021
How to get embedding matrix of bert in hugging face Beginners	8	41089	October 31, 2024
BertForPretraining hidden_states extraction with input embeddings as inputs Models	0	397	June 4, 2022
Embeddings in yieldBERT Beginners	0	172	January 25, 2024

For tuning a classifier head on a pretrained BERT should I use `last_hidden_state` or `outputs[0][:, 0, :]` from the BERT?

Related topics