Why last_hidden_state isnt the same for different heads of same checkpoint?

mdanieli · July 28, 2022, 5:01pm

If I understand correct, the last_hidden_state should reflect the embedding vector.
Qs:

Why last_hidden_state is different for different heads of same checkpoint?
Can I assume that the embedding vector will be trained by the LM head, and then be used as is for all Heads? Is that correct approach?
Is there a way to train only the head weights, and leave the embedding weights freezed, (for shorting training time)?

Thanks,
Michal.

Topic		Replies	Views
For tuning a classifier head on a pretrained BERT should I use `last_hidden_state` or `outputs[0][:, 0, :]` from the BERT? Beginners	0	180	February 15, 2024
Different last hidden state output on different machines, same tokens Beginners	0	348	April 22, 2023
Embedding layer or last hidden_hidden_state 🤗Transformers	0	210	November 1, 2023
On using the final [CLS] hidden state of RoBERTa Beginners	2	3022	November 9, 2023
MaskedLMOutput does not have last_hidden_state 🤗Transformers	0	1629	May 27, 2021