I am trying to get sentence embeddings from a llama2 model. I tried using the feature extraction pipeline and expect the output to be a tensor of size (seq_len, embedding_dim). but it is a list(list(list)))
Seems like it is of size (seq_len, vocab_size)? Could you please help me understand why?
Or what is the right way to get a sentence embedding for a Llama model. Thanks!
from transformers import LlamaTokenizer, LlamaForCausalLM, pipeline
sentences = ["This is me", "A 2nd sentence"]
model_base_name = "meta-llama/Llama-2-7b-hf"
model = LlamaForCausalLM.from_pretrained(model_base_name)
tokenizer = LlamaTokenizer.from_pretrained(model_base_name)
feature_extraction = pipeline('feature-extraction', model=model, tokenizer=tokenizer)
embeddings = feature_extraction(sentences) # output should be of size (seq_len, embedding_dim) but is of size (seq_len, vocab_size)
(Pdb) len(embeddings[0][0][0])
32000
(Pdb) len(embeddings[0][0])
4
(Pdb) len(embeddings[0])
1
len(tokenizer)
32000