Loading natural_questions

Arij · January 20, 2023, 5:59pm

using just

from datasets import load_dataset

dataset = load_dataset(“natural_questions”)

gives me the following error

File ~/anaconda3/lib/python3.9/site-packages/datasets/builder.py:1879, in BeamBasedBuilder._download_and_prepare(self, dl_manager, verify_infos, **prepare_splits_kwargs)
1877 if not beam_runner and not beam_options:
1878 usage_example = f"load_dataset(‘{self.name}’, ‘{self.config.name}’, beam_runner=‘DirectRunner’)"
→ 1879 raise MissingBeamOptions(
1880 "Trying to generate a dataset using Apache Beam, yet no Beam Runner "
1881 "or PipelineOptions() has been provided in load_dataset or in the "
1882 "builder arguments. For big datasets it has to run on large-scale data "
1883 "processing tools like Dataflow, Spark, etc. More information about "
1884 "Apache Beam runners at "
1885 “Apache Beam Capability Matrix”
1886 "\nIf you really want to run it locally because you feel like the "
1887 “Dataset is small enough, you can use the local beam runner called "
1888 “DirectRunner (you may run out of memory). \nExample of usage: "
1889 f”\n\t{usage_example}”
1890 )
1892 # Beam type checking assumes transforms multiple outputs are of same type,
1893 # which is not our case. Plus it doesn’t handle correctly all types, so we
1894 # are better without it.
1895 pipeline_options = {“pipeline_type_check”: False}

MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at Apache Beam Capability Matrix
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory).
Example of usage:
load_dataset('natural_questions', 'default', beam_runner='DirectRunner')

using load_dataset(‘natural_questions’, ‘default’, beam_runner=‘DirectRunner’) Gives this error

load_dataset(‘natural_questions’, ‘default’, beam_runner=‘DirectRunner’)
Downloading and preparing dataset natural_questions/default to /.cache/huggingface/datasets/natural_questions/default/0.0.4/da8124c83e3394df62c0f9bbc6c07652bbe9288ad833053134d5f0e978bb4ee5…
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.4k/17.4k [00:00<00:00, 1.03MB/s]
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 182.84it/s]

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 79.02parquet files/s]
0%| | 0/1 [03:38<?, ?shards/s]
Traceback (most recent call last):
File “”, line 1, in
File “/anaconda3/envs/lib/python3.9/site-packages/datasets/load.py”, line 1741, in load_dataset
builder_instance.download_and_prepare(
File “/anaconda3/envs/lib/python3.9/site-packages/datasets/builder.py”, line 822, in download_and_prepare
self._download_and_prepare(
File “/anaconda3/envs/lib/python3.9/site-packages/datasets/builder.py”, line 1920, in _download_and_prepare
num_examples, num_bytes = beam_writer.finalize(metrics.query(m_filter))
File “/anaconda3/envs/lib/python3.9/site-packages/datasets/arrow_writer.py”, line 676, in finalize
shard_num_bytes, _ = parquet_to_arrow(source, destination)
File “/anaconda3/envs/lib/python3.9/site-packages/datasets/arrow_writer.py”, line 719, in parquet_to_arrow
for record_batch in parquet_file.iter_batches():
File “pyarrow/_parquet.pyx”, line 1323, in iter_batches
File “pyarrow/error.pxi”, line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

lhoestq · January 23, 2023, 1:27pm

Hi ! What version of datasets are you using ? Can you try updating ? We fixed a bug like this one in datasets 2.8.0

Arij · January 23, 2023, 1:57pm

Acually the last error was from a new conda environment that I created it especially to solve this issue. In this environment I had installed just datsets using conda install datasets.

This is the result again

dataset = load_dataset(“natural_questions”)
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 281459.87it/s]
Using custom data configuration natural_questions-4f5f22b23f27c846
Downloading and preparing dataset json/natural_questions to /home/arij/.cache/huggingface/datasets/json/natural_questions-4f5f22b23f27c846/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab…
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2597.09it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [12:37<00:00, 378.77s/it]
Traceback (most recent call last):
File “”, line 1, in
File “anaconda3/envs/lib/python3.10/site-packages/datasets/load.py”, line 1742, in load_dataset
builder_instance.download_and_prepare(
File “anaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 814, in download_and_prepare
self._download_and_prepare(
File “anaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 905, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File “anaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 1520, in _prepare_split
writer.write_table(table)
File “anaconda3/envs/lib/python3.10/site-packages/datasets/arrow_writer.py”, line 540, in write_table
pa_table = table_cast(pa_table, self._schema)
File “anaconda3/envslib/python3.10/site-packages/datasets/table.py”, line 2068, in table_cast
return cast_table_to_schema(table, schema)
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 2030, in cast_table_to_schema
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 2030, in
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1740, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1740, in
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1867, in cast_array_to_feature
casted_values = _c(array.values, feature[0])
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1742, in wrapper
return func(array, *args, **kwargs)
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1862, in cast_array_to_feature
arrays = [_c(array.field(name), subfeature) for name, subfeature in feature.items()]
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1862, in
arrays = [_c(array.field(name), subfeature) for name, subfeature in feature.items()]
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1742, in wrapper
return func(array, *args, **kwargs)
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1912, in cast_array_to_feature
return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1742, in wrapper
return func(array, *args, **kwargs)
File “anaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1821, in array_cast
return array.cast(pa_type)
File “pyarrow/array.pxi”, line 915, in pyarrow.lib.Array.cast
File “anaconda3/envs/lib/python3.10/site-packages/pyarrow/compute.py”, line 376, in cast
return call_function(“cast”, [arr], options)
File “pyarrow/_compute.pyx”, line 542, in pyarrow._compute.call_function
File “pyarrow/_compute.pyx”, line 341, in pyarrow._compute.Function.call
File “pyarrow/error.pxi”, line 144, in pyarrow.lib.pyarrow_internal_check_status
File “pyarrow/error.pxi”, line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 3868261029123585616 not in range: -9007199254740992 to 9007199254740992

conda install datasets
conda list | grep datasets
datasets 2.6.1 pypi_0 pypi

pip install datasets
pip list | grep datasets
datasets 2.6.1

what I did as a final try is unistall the datasets using both conda and pip the reinstall the datasets using pip. Finally latest version of datasets was installed succesfully and the nq dataset was not installed again error:

File “pyarrow/array.pxi”, line 915, in pyarrow.lib.Array.cast
File “/home/arij/anaconda3/envs/QA/lib/python3.10/site-packages/pyarrow/compute.py”, line 376, in cast
return call_function(“cast”, [arr], options)
File “pyarrow/_compute.pyx”, line 542, in pyarrow._compute.call_function
File “pyarrow/_compute.pyx”, line 341, in pyarrow._compute.Function.call
File “pyarrow/error.pxi”, line 144, in pyarrow.lib.pyarrow_internal_check_status
File “pyarrow/error.pxi”, line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 3868261029123585616 not in range: -9007199254740992 to 9007199254740992

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “”, line 1, in
File “anaconda3/envs/lib/python3.10/site-packages/datasets/load.py”, line 1757, in load_dataset
builder_instance.download_and_prepare(
File “anaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 860, in download_and_prepare
self._download_and_prepare(
File “anaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 953, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File “anaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 1706, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File “anaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 1849, in _prepare_split_single
raise DatasetGenerationError(“An error occurred while generating the dataset”) from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Anyway yesterday I have installed the dataset from the offficial cite. Hope this issue will be fixed soon for future users

lhoestq · January 26, 2023, 10:00am

Can you try installing datasets using the official Hugging Face conda channel ?

conda install -c huggingface datasets

In your env you seem to have datasets 2.6.1 while this was fixed in 2.8.0

theodor1289 · April 2, 2023, 2:02pm

I am facing the same issue with a dataset of +170GB:

File “/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/lib64/python3.10/site-packages/datasets/builder.py”, line 1495, in _prepare_split
for key, table in logging.tqdm(
File “/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/lib64/python3.10/site-packages/tqdm/std.py”, line 1195, in iter
for obj in iterable:
File “/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/lib64/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py”, line 69, in _generate_tables
for batch_idx, record_batch in enumerate(
File “pyarrow/_parquet.pyx”, line 1323, in iter_batches
File “pyarrow/error.pxi”, line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

The code throws this at the following line:

current_dataset = load_dataset(
dataset_info.key,
dataset_info.subset,
split=“train”,
use_auth_token=True,
)

I am using datasets 2.11.0 installed with conda install -c huggingface datasets

crownor · May 12, 2023, 3:57am

I found this https://github.com/huggingface/datasets/issues/5695 , maybe each shard is too large

wangskyone · December 12, 2023, 6:07am

I solve the problem by this https://github.com/huggingface/datasets/issues/2181

Topic		Replies	Views
Load_dataset for natural questions stucks at checksum (windows) 🤗Datasets	1	445	February 13, 2023
Loading Custom Datasets 🤗Datasets	7	10743	May 25, 2021
ArrowNotImplementedError when loading json dataset 🤗Datasets	3	1761	December 17, 2021
DatasetGenerationError while loading dataset Beginners	3	2252	October 26, 2023
Nlp 0.3.0 is out! 🤗Datasets	3	848	July 8, 2020

Loading natural_questions

Related topics