Hidden states embedding tensors

I am trying to get the key and query vectors out of the Transformer layers but am confused by the docs regarding the embedding tensors provided for each layer. I have hidden_states=True. And the docs say:

hidden_states ( tuple(torch.FloatTensor) , optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True ) –

Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size) .

Hidden-states of the model at the output of each layer plus the initial embedding outputs.

Where are these embeddings coming from in each layer? At first I thought the embedding was from the token embedding layer as the very first input to the model but the embeddings across layers are not the same and it would not make sense to append this to the tuple for every layer. So where are these embeddings coming from?

In addition are they at the 0 index of the tuple? The docs say " output of the embeddings + one for the output" implying index 0 but then say “model at the output of each layer plus the initial embedding outputs.” implying index 1.

Bonus question: how can I get out the query vectors? If I have the keys and attentions then I think I can work out what the query vector is but is there an easier way?

Thank you,
Trenton

What it means is not “embeddings for each layer” but “output of embeddings” + “outputs of each layer”. So for a 12 layer model, you’d get the embedding output (1 tensor) + the outputs of the 12 following layers (12) = 13 tensors in total.

You can get the attention values back (output_attentions=True), but AFAIK not the query vectors.

Thanks for your reply Bram!

I realized that my tired brain got confused between the outputs of past_key_values and hidden_states and I was actually asking about the len 2 tuple returned by each layer of the past_key_values. It seems like index 0 is key and index 1 is the value? But that this has nothing to do with the embedding.

past_key_values ( tuple(tuple(torch.FloatTensor)) , optional, returned when use_cache=True is passed or when config.use_cache=True ) –

Tuple of tuple(torch.FloatTensor) of length config.n_layers , with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head) ) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head) .

Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

And sounds like I will have to invert the attention values and use the keys to work out the query vectors myself.

Thanks again and sorry for my confusion.

@BramVanroy so the 0th index is the embedding layer? And the 0th token would be the CLS token if the model is BertModelForSequenceClassification?

I’d also like to jump in as well because I can’t seem to find the answer I’m looking for anywhere else. Say that I want to extract speech vector embeddings with a Wav2Vec2 model and use them for a classification task. For that, I suppose I need the last hidden state of my model. If so, do I take tensor at the last index or the first index in the hidden_states tuple?

Intuitively, I think I should take one at the last index, as that would be the output of the last layer. However, I can’t be sure because I’m confused by what “output of embeddings/embedding output” means. (“Output of embeddings” doesn’t sound very self-explanatory. Is it supposed to be the initial embeddings of the speech input before going through model layers?)

This is not always the case. For example, facebook/opt-350m is inverted - the last item is the output from the embedding layer.