Hello everyone,
I am adding a FAISS index on the MSMARCO passages dataset that has ~8.8M passages. I have already created the embeddings with DPRContextTokenizer and DPRContextEncoder.
Dataset info:
After adding the FAISS index to it, I tried to retrieve some documents with a query.
q_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
question = 'what is the difference between a c-corp and a s-corp?'
question_embedding = q_encoder(**q_tokenizer(question, return_tensors="pt"))[0][0].detach().numpy()
scores, retrieved_documents = dataset_embedded_passages['train'].get_nearest_examples('embeddings', question_embedding, k=10)
And then it threw,
ArrowInvalid Traceback (most recent call last)
in
~/.conda/envs/andregodinho/lib/python3.6/site-packages/datasets/search.py in get_nearest_examples(self, index_name, query, k)
564 self._check_index_is_initialized(index_name)
565 scores, indices = self.search(index_name, query, k)
--> 566 return NearestExamplesResults(scores, self[[i for i in indices if i >= 0]])
567
568 def get_nearest_examples_batch(
~/.conda/envs/andregodinho/lib/python3.6/site-packages/datasets/arrow_dataset.py in __getitem__(self, key)
1069 format_columns=self._format_columns,
1070 output_all_columns=self._output_all_columns,
-> 1071 format_kwargs=self._format_kwargs,
1072 )
1073
~/.conda/envs/andregodinho/lib/python3.6/site-packages/datasets/arrow_dataset.py in _getitem(self, key, format_type, format_columns, output_all_columns, format_kwargs)
1037 )
1038 else:
-> 1039 data_subset = self._data.take(indices_array)
1040
1041 if format_type is not None:
~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.take()
~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/compute.py in take(data, indices, boundscheck)
266 """
267 options = TakeOptions(boundscheck)
--> 268 return call_function('take', [data, indices], options)
269
270
~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/_compute.pyx in pyarrow._compute.call_function()
~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/_compute.pyx in pyarrow._compute.Function.call()
~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
It threw ArrowInvalid: offset overflow while concatenating arrays
The dataset is very large. Any workaround?