[Semantic search with FAISS] Can't manage to format embeddings column to numpy format

Matthieu · December 8, 2021, 3:43pm

Hello,

I would like to test the semantic search with FAISS according the HF course but on wikipedia dataset.

I use a local docker with sentence-transformer algorithm to compute embeddings for each paragraph, according following function:

def get_embeddings(text_list):
    payload_resp = requests.request("POST", api_sent_embed_url, data=json.dumps(text_list))
    return np.array(json.loads(payload_resp.content.decode("utf-8")), dtype=np.float32)

Using this function I managed to obtain numpy array on output on those test data:

payload1 = ["Navigateur Web : Ce logiciel permet d'accéder à des pages web depuis votre ordinateur. Il en existe plusieurs téléchargeables gratuitement comme Google Chrome ou Mozilla. Certains sont même déjà installés comme Safari sur Mac OS et Edge sur Microsoft."]
payload2 = ["Google Chrome. Mozilla. Safari."]
payload = payload1 + payload2

embedding = get_embeddings(payload)
embedding.shape

(2, 384)

However, when trying to apply this on my embeddings_dataset

Dataset({
    features: ['id', 'revid', 'text', 'title', 'url'],
    num_rows: 100
})

I still obtain for the “embeddings” column list format and not numpy:

embeddings_dataset= embeddings_dataset.map(lambda x: {"embeddings": get_embeddings(x["text"])})
type(embeddings_test[0]["embeddings"])

list

Would anyone have an advice regarding this problem?

mariosasko · December 8, 2021, 4:33pm

Hi,

set format of the embeddings column to NumPy after the map call to get a NumPy array:

embeddings_dataset.set_format("numpy", columns=["embeddings"], output_all_columns=True)
type(embeddings_test[0]["embeddings"])

Topic		Replies	Views
Poor Results with FAISS Index on RAG System 🤗Transformers	0	611	March 13, 2024
Save_to_disk loses formatting information 🤗Datasets	1	349	September 30, 2022
FAISS similarity search error Intermediate	0	607	April 20, 2024
Add_faiss_index usage example Beginners	3	3467	April 22, 2022
Dataset.map saves list as numpy array instead of as list 🤗Datasets	2	1419	January 3, 2023

[Semantic search with FAISS] Can't manage to format embeddings column to numpy format

Related topics