Why last_hidden_state isnt the same for different heads of same checkpoint?

If I understand correct, the last_hidden_state should reflect the embedding vector.
Qs:

  1. Why last_hidden_state is different for different heads of same checkpoint?
  2. Can I assume that the embedding vector will be trained by the LM head, and then be used as is for all Heads? Is that correct approach?
  3. Is there a way to train only the head weights, and leave the embedding weights freezed, (for shorting training time)?

Thanks,
Michal.