Hello,
I would like to test the semantic search with FAISS according the HF course but on wikipedia dataset.
I use a local docker with sentence-transformer algorithm to compute embeddings for each paragraph, according following function:
def get_embeddings(text_list):
payload_resp = requests.request("POST", api_sent_embed_url, data=json.dumps(text_list))
return np.array(json.loads(payload_resp.content.decode("utf-8")), dtype=np.float32)
Using this function I managed to obtain numpy array on output on those test data:
payload1 = ["Navigateur Web : Ce logiciel permet d'accéder à des pages web depuis votre ordinateur. Il en existe plusieurs téléchargeables gratuitement comme Google Chrome ou Mozilla. Certains sont même déjà installés comme Safari sur Mac OS et Edge sur Microsoft."]
payload2 = ["Google Chrome. Mozilla. Safari."]
payload = payload1 + payload2
embedding = get_embeddings(payload)
embedding.shape
(2, 384)
However, when trying to apply this on my embeddings_dataset
Dataset({
features: ['id', 'revid', 'text', 'title', 'url'],
num_rows: 100
})
I still obtain for the “embeddings” column list format and not numpy:
embeddings_dataset= embeddings_dataset.map(lambda x: {"embeddings": get_embeddings(x["text"])})
type(embeddings_test[0]["embeddings"])
list
Would anyone have an advice regarding this problem?