If I understand correct, the last_hidden_state should reflect the embedding vector.
Qs:
- Why last_hidden_state is different for different heads of same checkpoint?
- Can I assume that the embedding vector will be trained by the LM head, and then be used as is for all Heads? Is that correct approach?
- Is there a way to train only the head weights, and leave the embedding weights freezed, (for shorting training time)?
Thanks,
Michal.