Embeddings from llama2

Saaira · October 3, 2023, 11:46pm

Hello,

I am trying to get sentence embeddings from a llama2 model. I tried using the feature extraction pipeline and expect the output to be a tensor of size (seq_len, embedding_dim). but it is a list(list(list)))

Seems like it is of size (seq_len, vocab_size)? Could you please help me understand why?

Or what is the right way to get a sentence embedding for a Llama model. Thanks!

from transformers import LlamaTokenizer, LlamaForCausalLM, pipeline
sentences = ["This is me", "A 2nd sentence"]
model_base_name = "meta-llama/Llama-2-7b-hf"
model = LlamaForCausalLM.from_pretrained(model_base_name)
tokenizer = LlamaTokenizer.from_pretrained(model_base_name)
feature_extraction = pipeline('feature-extraction', model=model, tokenizer=tokenizer)
embeddings = feature_extraction(sentences) # output should be of size (seq_len, embedding_dim) but is of size (seq_len, vocab_size)

(Pdb) len(embeddings[0][0][0])
32000

(Pdb) len(embeddings[0][0])
4

(Pdb) len(embeddings[0])
1

len(tokenizer)
32000

jasperlp · October 11, 2023, 1:54am

I have the same situation as mentioned by Saaira.
Does anyone have any solutions or explanation for this?
Thanks!

jasperlp · October 11, 2023, 2:29am

Hi @Saaira,

Just found the why and how for this question.

Why:
The pipeline generally returns the first available tensor, which refers to the logits in the Llama model
Ref:

How:
Instead of using the pipeline for efficiency and neat codes,
use

model(torch.IntTensor([tokenizer(sentences)['input_ids'][0]]),return_dict=True, output_hidden_states=True)['hidden_states']

you can get the hidden states from all the layers (including the embedding layer) for each token,
you will get for the first sentence

len(embeddings['hidden_states']), embeddings['hidden_states'][0].shape
(33, torch.Size([1, 4, 4096]))

Saaira · October 12, 2023, 12:12am

Thanks @jasperlp Very helpful!

What did you find to be the best pooling strategy with llama embeddings?

jasperlp · October 12, 2023, 10:16pm

Haven’t tried too much on that. I suppose this is tasks by task thing

SeanLee97 · November 1, 2023, 2:05am

You can use AnglE-LLaMA to extract sentence embedding from LLaMA/LLaMA2: GitHub - SeanLee97/AnglE: Angle-optimized Text Embeddings | 🔥 New SOTA

wilfoderek · December 13, 2023, 6:43pm

What is the max input length?

Topic		Replies	Views
Getting the same embedding from llama 2 class token for any input 🤗Transformers	1	1293	December 4, 2023
Extracting token embeddings from pretrained language models Beginners	9	22173	May 2, 2024
How to calculate embeddings with Llama-2 model Beginners	3	13260	October 31, 2023
Extracting embedding values of NLP pertained models from tokenized strings 🤗Tokenizers	3	2221	August 18, 2021
What is an embedding? Intermediate	4	995	July 22, 2024

Embeddings from llama2

Related topics