Hello,
I would like to create a vector database for querying word/concept. To create this database I need embeddings capturing the meaning of a word within a sentence. The idea is, for instance, to allow 10 different “meaning” (eg: distant vector) for each word/concept.
Are models like BLOOM or LLaMA suited for this ? I tried the following basic code but It wasn’t what I expected.
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-1b7")
model = AutoModel.from_pretrained("bigscience/bloomz-1b7")
sent1 = "I deposited my paycheck at the bank."
sent2 = "The children sat on the bank of the river."
sent3 = "I went to the bank."
tokens = tokenizer(sent1, add_special_tokens=False, return_tensors="pt")
tokens2 = tokenizer(sent2, add_special_tokens=False, return_tensors="pt")
tokens3 = tokenizer(sent3,add_special_tokens=False, return_tensors="pt")
bank1 = model(**tokens).last_hidden_state[:,6,:]
bank2 = model(**tokens2).last_hidden_state[:,8,:]
bank3 = model(**tokens3).last_hidden_state[:,4,:]
children = model(**tokens2).last_hidden_state[:,2,:]
print(cosine_similarity(bank1.detach().numpy(), bank2.detach().numpy())) # [[0.9282925]]
print(cosine_similarity(bank2.detach().numpy(), bank3.detach().numpy())) # [[0.9136802]]
print(cosine_similarity(bank1.detach().numpy(), bank3.detach().numpy())) # [[0.89588195]]
print(cosine_similarity(bank1.detach().numpy(), children.detach().numpy())) # [[0.91757184]]
I expected the cosine to be lower especially with (children, bank)
tuple. Am I heading in the wrong direction ?
Thank you in advance !