Loading natural_questions

Arij · January 20, 2023, 5:59pm

using just

from datasets import load_dataset

dataset = load_dataset(“natural_questions”)

gives me the following error

File ~/anaconda3/lib/python3.9/site-packages/datasets/builder.py:1879, in BeamBasedBuilder._download_and_prepare(self, dl_manager, verify_infos, **prepare_splits_kwargs)
1877 if not beam_runner and not beam_options:
1878 usage_example = f"load_dataset(‘{self.name}’, ‘{self.config.name}’, beam_runner=‘DirectRunner’)"
→ 1879 raise MissingBeamOptions(
1880 "Trying to generate a dataset using Apache Beam, yet no Beam Runner "
1881 "or PipelineOptions() has been provided in load_dataset or in the "
1882 "builder arguments. For big datasets it has to run on large-scale data "
1883 "processing tools like Dataflow, Spark, etc. More information about "
1884 "Apache Beam runners at "
1885 “Apache Beam Capability Matrix”
1886 "\nIf you really want to run it locally because you feel like the "
1887 “Dataset is small enough, you can use the local beam runner called "
1888 “DirectRunner (you may run out of memory). \nExample of usage: "
1889 f”\n\t{usage_example}”
1890 )
1892 # Beam type checking assumes transforms multiple outputs are of same type,
1893 # which is not our case. Plus it doesn’t handle correctly all types, so we
1894 # are better without it.
1895 pipeline_options = {“pipeline_type_check”: False}

MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at Apache Beam Capability Matrix
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory).
Example of usage:
load_dataset('natural_questions', 'default', beam_runner='DirectRunner')

using load_dataset(‘natural_questions’, ‘default’, beam_runner=‘DirectRunner’) Gives this error

load_dataset(‘natural_questions’, ‘default’, beam_runner=‘DirectRunner’)
Downloading and preparing dataset natural_questions/default to /.cache/huggingface/datasets/natural_questions/default/0.0.4/da8124c83e3394df62c0f9bbc6c07652bbe9288ad833053134d5f0e978bb4ee5…
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.4k/17.4k [00:00<00:00, 1.03MB/s]
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 182.84it/s]

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 79.02parquet files/s]
0%| | 0/1 [03:38<?, ?shards/s]
Traceback (most recent call last):
File “”, line 1, in
File “/anaconda3/envs/lib/python3.9/site-packages/datasets/load.py”, line 1741, in load_dataset
builder_instance.download_and_prepare(
File “/anaconda3/envs/lib/python3.9/site-packages/datasets/builder.py”, line 822, in download_and_prepare
self._download_and_prepare(
File “/anaconda3/envs/lib/python3.9/site-packages/datasets/builder.py”, line 1920, in _download_and_prepare
num_examples, num_bytes = beam_writer.finalize(metrics.query(m_filter))
File “/anaconda3/envs/lib/python3.9/site-packages/datasets/arrow_writer.py”, line 676, in finalize
shard_num_bytes, _ = parquet_to_arrow(source, destination)
File “/anaconda3/envs/lib/python3.9/site-packages/datasets/arrow_writer.py”, line 719, in parquet_to_arrow
for record_batch in parquet_file.iter_batches():
File “pyarrow/_parquet.pyx”, line 1323, in iter_batches
File “pyarrow/error.pxi”, line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

Topic		Replies	Views
Load_dataset for natural questions stucks at checksum (windows) 🤗Datasets	1	443	February 13, 2023
DatasetGenerationError while loading dataset Beginners	3	2241	October 26, 2023
Dataset generation error after downloading all the parquet files 🤗Datasets	6	5032	December 11, 2024
Unable to Load Dataset Using `load_dataset` 🤗Datasets	10	378	March 11, 2025
Datasets.load_datasets fails 🤗Datasets	12	850	October 11, 2024

Loading natural_questions

Related topics