[Semantic search with FAISS] Can't manage to format embeddings column to numpy format

Hello,

I would like to test the semantic search with FAISS according the HF course but on wikipedia dataset.

I use a local docker with sentence-transformer algorithm to compute embeddings for each paragraph, according following function:

def get_embeddings(text_list):
    payload_resp = requests.request("POST", api_sent_embed_url, data=json.dumps(text_list))
    return np.array(json.loads(payload_resp.content.decode("utf-8")), dtype=np.float32)

Using this function I managed to obtain numpy array on output on those test data:

payload1 = ["Navigateur Web : Ce logiciel permet d'accéder à des pages web depuis votre ordinateur. Il en existe plusieurs téléchargeables gratuitement comme Google Chrome ou Mozilla. Certains sont même déjà installés comme Safari sur Mac OS et Edge sur Microsoft."]
payload2 = ["Google Chrome. Mozilla. Safari."]
payload = payload1 + payload2

embedding = get_embeddings(payload)
embedding.shape

(2, 384)

However, when trying to apply this on my embeddings_dataset

Dataset({
    features: ['id', 'revid', 'text', 'title', 'url'],
    num_rows: 100
})

I still obtain for the “embeddings” column list format and not numpy:

embeddings_dataset= embeddings_dataset.map(lambda x: {"embeddings": get_embeddings(x["text"])})
type(embeddings_test[0]["embeddings"])

list

Would anyone have an advice regarding this problem?

Hi,

set format of the embeddings column to NumPy after the map call to get a NumPy array:

embeddings_dataset.set_format("numpy", columns=["embeddings"], output_all_columns=True)
type(embeddings_test[0]["embeddings"])
2 Likes