.get_nearest_examples() throws ArrowInvalid: offset overflow while concatenating arrays

Hello everyone,

I am adding a FAISS index on the MSMARCO passages dataset that has ~8.8M passages. I have already created the embeddings with DPRContextTokenizer and DPRContextEncoder.

Dataset info:

After adding the FAISS index to it, I tried to retrieve some documents with a query.

q_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

question = 'what is the difference between a c-corp and a s-corp?'
question_embedding = q_encoder(**q_tokenizer(question, return_tensors="pt"))[0][0].detach().numpy()

scores, retrieved_documents = dataset_embedded_passages['train'].get_nearest_examples('embeddings', question_embedding, k=10)

And then it threw,

ArrowInvalid                              Traceback (most recent call last)

in

~/.conda/envs/andregodinho/lib/python3.6/site-packages/datasets/search.py in get_nearest_examples(self, index_name, query, k)
    564         self._check_index_is_initialized(index_name)
    565         scores, indices = self.search(index_name, query, k)
--> 566         return NearestExamplesResults(scores, self[[i for i in indices if i >= 0]])
    567 
    568     def get_nearest_examples_batch(

~/.conda/envs/andregodinho/lib/python3.6/site-packages/datasets/arrow_dataset.py in __getitem__(self, key)
   1069             format_columns=self._format_columns,
   1070             output_all_columns=self._output_all_columns,
-> 1071             format_kwargs=self._format_kwargs,
   1072         )
   1073 

~/.conda/envs/andregodinho/lib/python3.6/site-packages/datasets/arrow_dataset.py in _getitem(self, key, format_type, format_columns, output_all_columns, format_kwargs)
   1037                 )
   1038             else:
-> 1039                 data_subset = self._data.take(indices_array)
   1040 
   1041             if format_type is not None:

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.take()

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/compute.py in take(data, indices, boundscheck)
    266     """
    267     options = TakeOptions(boundscheck)
--> 268     return call_function('take', [data, indices], options)
    269 
    270 

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/_compute.pyx in pyarrow._compute.call_function()

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/_compute.pyx in pyarrow._compute.Function.call()

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

It threw ArrowInvalid: offset overflow while concatenating arrays

The dataset is very large. Any workaround?

Thanks for reporting !

It will be fixed in the datasets release of this week

1 Like

Thank you Quentin. Please let me know once you have fixed it.

Actually I think it’s already included in datasets==1.0.2
Could you update the lib and le me know if it fixes your issue ?

pip install --upgrade datasets
1 Like

Problem solved.

Cheers!

1 Like