Last week, the following code was working:
dataset = load_dataset(‘wikipedia’, ‘20220301.en’)
This week, it raises the following error:
MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in
load_dataset
or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at Apache Beam Capability Matrix
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner calledDirectRunner
(you may run out of memory).
Example of usage:
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')
When I try to follow the example of usage, the error is this one:
Couldn’t find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json
Does someone know how to load the English Wikipedia dataset?