Loading natural_questions

using just

from datasets import load_dataset

dataset = load_dataset(β€œnatural_questions”)

gives me the following error

File ~/anaconda3/lib/python3.9/site-packages/datasets/builder.py:1879, in BeamBasedBuilder._download_and_prepare(self, dl_manager, verify_infos, **prepare_splits_kwargs)
1877 if not beam_runner and not beam_options:
1878 usage_example = f"load_dataset(β€˜{self.name}’, β€˜{self.config.name}’, beam_runner=β€˜DirectRunner’)"
β†’ 1879 raise MissingBeamOptions(
1880 "Trying to generate a dataset using Apache Beam, yet no Beam Runner "
1881 "or PipelineOptions() has been provided in load_dataset or in the "
1882 "builder arguments. For big datasets it has to run on large-scale data "
1883 "processing tools like Dataflow, Spark, etc. More information about "
1884 "Apache Beam runners at "
1885 β€œApache Beam Capability Matrix”
1886 "\nIf you really want to run it locally because you feel like the "
1887 β€œDataset is small enough, you can use the local beam runner called "
1888 β€œDirectRunner (you may run out of memory). \nExample of usage: "
1889 f”\n\t{usage_example}”
1890 )
1892 # Beam type checking assumes transforms multiple outputs are of same type,
1893 # which is not our case. Plus it doesn’t handle correctly all types, so we
1894 # are better without it.
1895 pipeline_options = {β€œpipeline_type_check”: False}

MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at Apache Beam Capability Matrix
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory).
Example of usage:
load_dataset('natural_questions', 'default', beam_runner='DirectRunner')

using load_dataset(β€˜natural_questions’, β€˜default’, beam_runner=β€˜DirectRunner’) Gives this error

load_dataset(β€˜natural_questions’, β€˜default’, beam_runner=β€˜DirectRunner’)
Downloading and preparing dataset natural_questions/default to /.cache/huggingface/datasets/natural_questions/default/0.0.4/da8124c83e3394df62c0f9bbc6c07652bbe9288ad833053134d5f0e978bb4ee5…
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 17.4k/17.4k [00:00<00:00, 1.03MB/s]
Downloading data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 182.84it/s]

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 79.02parquet files/s]
0%| | 0/1 [03:38<?, ?shards/s]
Traceback (most recent call last):
File β€œβ€, line 1, in
File β€œ/anaconda3/envs/lib/python3.9/site-packages/datasets/load.py”, line 1741, in load_dataset
builder_instance.download_and_prepare(
File β€œ/anaconda3/envs/lib/python3.9/site-packages/datasets/builder.py”, line 822, in download_and_prepare
self._download_and_prepare(
File β€œ/anaconda3/envs/lib/python3.9/site-packages/datasets/builder.py”, line 1920, in _download_and_prepare
num_examples, num_bytes = beam_writer.finalize(metrics.query(m_filter))
File β€œ/anaconda3/envs/lib/python3.9/site-packages/datasets/arrow_writer.py”, line 676, in finalize
shard_num_bytes, _ = parquet_to_arrow(source, destination)
File β€œ/anaconda3/envs/lib/python3.9/site-packages/datasets/arrow_writer.py”, line 719, in parquet_to_arrow
for record_batch in parquet_file.iter_batches():
File β€œpyarrow/_parquet.pyx”, line 1323, in iter_batches
File β€œpyarrow/error.pxi”, line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

Hi ! What version of datasets are you using ? Can you try updating ? We fixed a bug like this one in datasets 2.8.0

Acually the last error was from a new conda environment that I created it especially to solve this issue. In this environment I had installed just datsets using conda install datasets.

This is the result again

dataset = load_dataset(β€œnatural_questions”)
Resolving data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 51/51 [00:00<00:00, 281459.87it/s]
Using custom data configuration natural_questions-4f5f22b23f27c846
Downloading and preparing dataset json/natural_questions to /home/arij/.cache/huggingface/datasets/json/natural_questions-4f5f22b23f27c846/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab…
Downloading data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 2597.09it/s]
Extracting data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [12:37<00:00, 378.77s/it]
Traceback (most recent call last):
File β€œβ€, line 1, in
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/load.py”, line 1742, in load_dataset
builder_instance.download_and_prepare(
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 814, in download_and_prepare
self._download_and_prepare(
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 905, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 1520, in _prepare_split
writer.write_table(table)
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/arrow_writer.py”, line 540, in write_table
pa_table = table_cast(pa_table, self._schema)
File β€œanaconda3/envslib/python3.10/site-packages/datasets/table.py”, line 2068, in table_cast
return cast_table_to_schema(table, schema)
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 2030, in cast_table_to_schema
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 2030, in
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1740, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1740, in
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1867, in cast_array_to_feature
casted_values = _c(array.values, feature[0])
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1742, in wrapper
return func(array, *args, **kwargs)
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1862, in cast_array_to_feature
arrays = [_c(array.field(name), subfeature) for name, subfeature in feature.items()]
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1862, in
arrays = [_c(array.field(name), subfeature) for name, subfeature in feature.items()]
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1742, in wrapper
return func(array, *args, **kwargs)
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1912, in cast_array_to_feature
return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1742, in wrapper
return func(array, *args, **kwargs)
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/table.py”, line 1821, in array_cast
return array.cast(pa_type)
File β€œpyarrow/array.pxi”, line 915, in pyarrow.lib.Array.cast
File β€œanaconda3/envs/lib/python3.10/site-packages/pyarrow/compute.py”, line 376, in cast
return call_function(β€œcast”, [arr], options)
File β€œpyarrow/_compute.pyx”, line 542, in pyarrow._compute.call_function
File β€œpyarrow/_compute.pyx”, line 341, in pyarrow._compute.Function.call
File β€œpyarrow/error.pxi”, line 144, in pyarrow.lib.pyarrow_internal_check_status
File β€œpyarrow/error.pxi”, line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 3868261029123585616 not in range: -9007199254740992 to 9007199254740992

conda install datasets
conda list | grep datasets
datasets 2.6.1 pypi_0 pypi

pip install datasets
pip list | grep datasets
datasets 2.6.1

what I did as a final try is unistall the datasets using both conda and pip the reinstall the datasets using pip. Finally latest version of datasets was installed succesfully and the nq dataset was not installed again error:

File β€œpyarrow/array.pxi”, line 915, in pyarrow.lib.Array.cast
File β€œ/home/arij/anaconda3/envs/QA/lib/python3.10/site-packages/pyarrow/compute.py”, line 376, in cast
return call_function(β€œcast”, [arr], options)
File β€œpyarrow/_compute.pyx”, line 542, in pyarrow._compute.call_function
File β€œpyarrow/_compute.pyx”, line 341, in pyarrow._compute.Function.call
File β€œpyarrow/error.pxi”, line 144, in pyarrow.lib.pyarrow_internal_check_status
File β€œpyarrow/error.pxi”, line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 3868261029123585616 not in range: -9007199254740992 to 9007199254740992

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File β€œβ€, line 1, in
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/load.py”, line 1757, in load_dataset
builder_instance.download_and_prepare(
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 860, in download_and_prepare
self._download_and_prepare(
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 953, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 1706, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File β€œanaconda3/envs/lib/python3.10/site-packages/datasets/builder.py”, line 1849, in _prepare_split_single
raise DatasetGenerationError(β€œAn error occurred while generating the dataset”) from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Anyway yesterday I have installed the dataset from the offficial cite. Hope this issue will be fixed soon for future users

Can you try installing datasets using the official Hugging Face conda channel ?

conda install -c huggingface datasets

In your env you seem to have datasets 2.6.1 while this was fixed in 2.8.0