.get_nearest_examples() throws ArrowInvalid: offset overflow while concatenating arrays

AndreGodinho · September 29, 2020, 2:16pm

Hello everyone,

I am adding a FAISS index on the MSMARCO passages dataset that has ~8.8M passages. I have already created the embeddings with DPRContextTokenizer and DPRContextEncoder.

Dataset info:

After adding the FAISS index to it, I tried to retrieve some documents with a query.

q_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

question = 'what is the difference between a c-corp and a s-corp?'
question_embedding = q_encoder(**q_tokenizer(question, return_tensors="pt"))[0][0].detach().numpy()

scores, retrieved_documents = dataset_embedded_passages['train'].get_nearest_examples('embeddings', question_embedding, k=10)

And then it threw,

ArrowInvalid                              Traceback (most recent call last)

in

~/.conda/envs/andregodinho/lib/python3.6/site-packages/datasets/search.py in get_nearest_examples(self, index_name, query, k)
    564         self._check_index_is_initialized(index_name)
    565         scores, indices = self.search(index_name, query, k)
--> 566         return NearestExamplesResults(scores, self[[i for i in indices if i >= 0]])
    567 
    568     def get_nearest_examples_batch(

~/.conda/envs/andregodinho/lib/python3.6/site-packages/datasets/arrow_dataset.py in __getitem__(self, key)
   1069             format_columns=self._format_columns,
   1070             output_all_columns=self._output_all_columns,
-> 1071             format_kwargs=self._format_kwargs,
   1072         )
   1073 

~/.conda/envs/andregodinho/lib/python3.6/site-packages/datasets/arrow_dataset.py in _getitem(self, key, format_type, format_columns, output_all_columns, format_kwargs)
   1037                 )
   1038             else:
-> 1039                 data_subset = self._data.take(indices_array)
   1040 
   1041             if format_type is not None:

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.take()

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/compute.py in take(data, indices, boundscheck)
    266     """
    267     options = TakeOptions(boundscheck)
--> 268     return call_function('take', [data, indices], options)
    269 
    270 

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/_compute.pyx in pyarrow._compute.call_function()

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/_compute.pyx in pyarrow._compute.Function.call()

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.conda/envs/andregodinho/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

It threw ArrowInvalid: offset overflow while concatenating arrays

The dataset is very large. Any workaround?

lhoestq · September 30, 2020, 11:52am

Thanks for reporting !

It will be fixed in the datasets release of this week

AndreGodinho · September 30, 2020, 12:16pm

Thank you Quentin. Please let me know once you have fixed it.

lhoestq · September 30, 2020, 1:12pm

Actually I think it’s already included in datasets==1.0.2
Could you update the lib and le me know if it fixes your issue ?

pip install --upgrade datasets

AndreGodinho · September 30, 2020, 2:47pm

Problem solved.

Cheers!

Topic		Replies	Views
Add_faiss_index usage example Beginners	3	3473	April 22, 2022
Chapter 5 questions Course	105	8473	July 7, 2025
Seeing AttributeError: 'Dataset' object has no attribute 'reshape' when using "dataset.get_nearest_examples" 🤗Datasets	3	1780	June 28, 2023
Fetching rows of a large Dataset by index 🤗Datasets	10	1635	March 15, 2021
datasets.Dataset.get_nearest_examples() on GPU 🤗Datasets	6	4119	October 4, 2020

.get_nearest_examples() throws ArrowInvalid: offset overflow while concatenating arrays

Related topics