using just
from datasets import load_dataset
dataset = load_dataset(βnatural_questionsβ)
gives me the following error
File ~/anaconda3/lib/python3.9/site-packages/datasets/builder.py:1879, in BeamBasedBuilder._download_and_prepare(self, dl_manager, verify_infos, **prepare_splits_kwargs)
1877 if not beam_runner and not beam_options:
1878 usage_example = f"load_dataset(β{self.name}β, β{self.config.name}β, beam_runner=βDirectRunnerβ)"
β 1879 raise MissingBeamOptions(
1880 "Trying to generate a dataset using Apache Beam, yet no Beam Runner "
1881 "or PipelineOptions() has been provided in load_dataset
or in the "
1882 "builder arguments. For big datasets it has to run on large-scale data "
1883 "processing tools like Dataflow, Spark, etc. More information about "
1884 "Apache Beam runners at "
1885 βApache Beam Capability Matrixβ
1886 "\nIf you really want to run it locally because you feel like the "
1887 βDataset is small enough, you can use the local beam runner called "
1888 βDirectRunner
(you may run out of memory). \nExample of usage: "
1889 fβ\n\t{usage_example}
β
1890 )
1892 # Beam type checking assumes transforms multiple outputs are of same type,
1893 # which is not our case. Plus it doesnβt handle correctly all types, so we
1894 # are better without it.
1895 pipeline_options = {βpipeline_type_checkβ: False}
MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset
or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at Apache Beam Capability Matrix
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner
(you may run out of memory).
Example of usage:
load_dataset('natural_questions', 'default', beam_runner='DirectRunner')
using load_dataset(βnatural_questionsβ, βdefaultβ, beam_runner=βDirectRunnerβ) Gives this error
load_dataset(βnatural_questionsβ, βdefaultβ, beam_runner=βDirectRunnerβ)
Downloading and preparing dataset natural_questions/default to /.cache/huggingface/datasets/natural_questions/default/0.0.4/da8124c83e3394df62c0f9bbc6c07652bbe9288ad833053134d5f0e978bb4ee5β¦
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 17.4k/17.4k [00:00<00:00, 1.03MB/s]
Downloading data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 182.84it/s]
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 79.02parquet files/s]
0%| | 0/1 [03:38<?, ?shards/s]
Traceback (most recent call last):
File ββ, line 1, in
File β/anaconda3/envs/lib/python3.9/site-packages/datasets/load.pyβ, line 1741, in load_dataset
builder_instance.download_and_prepare(
File β/anaconda3/envs/lib/python3.9/site-packages/datasets/builder.pyβ, line 822, in download_and_prepare
self._download_and_prepare(
File β/anaconda3/envs/lib/python3.9/site-packages/datasets/builder.pyβ, line 1920, in _download_and_prepare
num_examples, num_bytes = beam_writer.finalize(metrics.query(m_filter))
File β/anaconda3/envs/lib/python3.9/site-packages/datasets/arrow_writer.pyβ, line 676, in finalize
shard_num_bytes, _ = parquet_to_arrow(source, destination)
File β/anaconda3/envs/lib/python3.9/site-packages/datasets/arrow_writer.pyβ, line 719, in parquet_to_arrow
for record_batch in parquet_file.iter_batches():
File βpyarrow/_parquet.pyxβ, line 1323, in iter_batches
File βpyarrow/error.pxiβ, line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs