Add_faiss_index usage example

Hi, I am trying to know how to use Rag/DPR, but first I want to get familiar with faiss usage.
I checked the official example in

But it seems the snippet code is not self-executable.
So I did some modification, aiming to retrieve similar examples in the sst2 dataset with query ‘I am happy’.

import datasets
from transformers import pipeline
embed = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")
ds = datasets.load_dataset('glue', 'sst2', split='test')
ds_with_embeddings = ds.map(lambda example: {'embeddings': embed(example['sentence'])})
ds_with_embeddings.add_faiss_index(column='embeddings')
# query
scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', embed('I am happy.'), k=10)
# save index
ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss')

ds = datasets.load_dataset('glue', 'sst2', split='test')
# load index
ds.load_faiss_index('embeddings', 'my_index.faiss')
# query
scores, retrieved_examples = ds.get_nearest_examples('embeddings', embed('I am happy.'), k=10)

My problem is at ds_with_embeddings.add_faiss_index(column='embeddings')
I got error with there with " TypeError: float() argument must be a string or a number, not ‘dict’ "
If I changed it to

ds_with_embeddings_score = ds_with_embeddings.map(lambda example: {'embeddings_score': example['embeddings'][0]['score']})
ds_with_embeddings_score.add_faiss_index(column='embeddings_score')

Then I got error " TypeError: len() of unsized object "
Any adivce?Thanks.

I have little experience with pipelines, but I think the issue is that embed(example['sentence']) should return a vector representation for example['sentence']. However, calling a text classification pipeline returns a dict with labels and scores. Instead, you need to run a feature extraction pipeline which should return vectors. (You may need to unpack it though, as the return type is a nested list, presumably for batched processing.)

So (untested) you can try something like:

embed = pipeline('feature-extraction', model="nlptown/bert-base-multilingual-uncased-sentiment")
...
ds_with_embeddings = ds.map(lambda example: {'embeddings': embed(example['sentence'])[0]})

You may need to test a bit with the [0]. Not sure whether it is necessary.

It seems the embed needs to give a fixed size feature, if I use pipeline with ‘feature-extraction’, I need to conduct another pooing .
I used a third-party sentence embedding extractor and it seems work
My executable code is:

import datasets
from sentence_transformers import SentenceTransformer

embed = SentenceTransformer("nli-bert-large-max-pooling")
ds = datasets.load_dataset('glue', 'sst2', split='test')
ds_with_embeddings = ds.map(lambda example: {'embeddings': embed.encode(example['sentence'])})
ds_with_embeddings.add_faiss_index(column='embeddings')
# query
scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', embed.encode('I am happy.'), k=10)
# save index
ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss')

ds = datasets.load_dataset('glue', 'sst2', split='test')
# load index
ds.load_faiss_index('embeddings', 'my_index.faiss')
# query
scores, retrieved_examples = ds.get_nearest_examples('embeddings', embed.encode('I am happy.'), k=10)
print("\n scores: \n{}".format(scores))
print("\n retrieved_examples: \n{}\n".format(retrieved_examples))

with the output:

 scores: 
[ 93.086624 115.375305 135.21281  137.25983  146.3393   152.40036
 165.03342  165.48126  166.6388   181.6477  ]

 retrieved_examples: 
OrderedDict([('idx', [138, 874, 1251, 157, 144, 917, 21, 169, 1502, 1392]), ('label', [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]), ('sentence', ['i loved it !', 'the result is something quite fresh and delightful .', 'remarkably accessible and affecting .', 'i admired this work a lot .', 'witty , touching and well paced .', '... in this incarnation its fizz is infectious .', 'a feel-good picture in the best sense of the term .', "brilliant ! '", "is n't it great ?", 'go see it and enjoy .'])])

If anyone could provide a HF-native code snippet, I would appreciate.

You may use add_faiss_index_from_external_arrays instead.