Using LLMs word embeddings within context

Veeko · July 3, 2023, 5:39pm

Hello,
I would like to create a vector database for querying word/concept. To create this database I need embeddings capturing the meaning of a word within a sentence. The idea is, for instance, to allow 10 different “meaning” (eg: distant vector) for each word/concept.
Are models like BLOOM or LLaMA suited for this ? I tried the following basic code but It wasn’t what I expected.

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-1b7")
model = AutoModel.from_pretrained("bigscience/bloomz-1b7")
sent1 = "I deposited my paycheck at the bank."
sent2 = "The children sat on the bank of the river."
sent3 = "I went to the bank."
tokens = tokenizer(sent1, add_special_tokens=False, return_tensors="pt")
tokens2 = tokenizer(sent2,  add_special_tokens=False,  return_tensors="pt")
tokens3 = tokenizer(sent3,add_special_tokens=False,  return_tensors="pt")
bank1 = model(**tokens).last_hidden_state[:,6,:]
bank2 = model(**tokens2).last_hidden_state[:,8,:]
bank3 = model(**tokens3).last_hidden_state[:,4,:]
children = model(**tokens2).last_hidden_state[:,2,:]

print(cosine_similarity(bank1.detach().numpy(), bank2.detach().numpy())) # [[0.9282925]]
print(cosine_similarity(bank2.detach().numpy(), bank3.detach().numpy())) # [[0.9136802]]
print(cosine_similarity(bank1.detach().numpy(), bank3.detach().numpy())) # [[0.89588195]]
print(cosine_similarity(bank1.detach().numpy(), children.detach().numpy())) # [[0.91757184]]

I expected the cosine to be lower especially with (children, bank) tuple. Am I heading in the wrong direction ?
Thank you in advance !

emedinag · January 25, 2024, 3:07am

Hi Veeko, I’m newbie but, it could be the model?

Look your approach with a different model and tokenizer

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
from sentence_transformers import SentenceTransformer

tokenizer = AutoTokenizer.from_pretrained("oobabooga/llama-tokenizer")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = model.to(device)
sent1 = "I deposited my paycheck at the bank."
sent2 = "The children sat on the bank of the river."
sent3 = "I went to the bank."
sentences = [sent1, sent2, sent3]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)

print(cosine_similarity(embeddings[[0]], embeddings[[1]])) # [[0.9282925]]
print(cosine_similarity(embeddings[[1]], embeddings[[2]])) # [[0.9136802]]
print(cosine_similarity(embeddings[[0]], embeddings[[2]])) # [[0.89588195]]
#[[0.20213181]]
#[[0.2704972]]
#[[0.6851812]]

emedinag · January 25, 2024, 3:11am

please look Computing Sentence Embeddings — Sentence-Transformers documentation

Topic		Replies	Views
Obtaining word-embeddings from Roberta Beginners	13	13248	January 18, 2022
Get word embeddings from transformer model Beginners	1	13868	June 17, 2021
Generate raw word embeddings using transformer models like BERT for downstream process Beginners	9	39898	October 4, 2021
Embeddings from llama2 🤗Transformers	6	12291	December 13, 2023
LLM and different embeddings interaction Beginners	0	659	October 17, 2023

Using LLMs word embeddings within context

Related topics