I am using llama2 casual model
multimodal_embeddings, multimodal_attention_mask = self._build_multimodal_attention(
input_embeddings, projected_patch_embeddings, attention_mask
self.language_model.generate(inputs_embeds=multimodal_embeddings,max_new_tokens=8,output_hidden_states=True,return_dict_in_generate=True)
I want to get each generated token last layer hidden state.
But I don’t know whether language_model_output.hidden_states[0][-1]. is the first generated token hidden state because it is different:
language_model_output.hidden_states[0][-1].shape
torch.Size([1, 535, 4096]) # why it is same with multimodal_embeddings.shape, not 1
(Pdb) language_model_output.hidden_states[1][-1].shape
torch.Size([1, 1, 4096])
(Pdb) multimodal_embeddings.shape
torch.Size([1, 535, 4096])