Error loading Wikipedia Dataset

Last week, the following code was working:
dataset = load_dataset(‘wikipedia’, ‘20220301.en’)

This week, it raises the following error:

MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at Apache Beam Capability Matrix
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory).
Example of usage:
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')

When I try to follow the example of usage, the error is this one:

Couldn’t find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json

Does someone know how to load the English Wikipedia dataset?

2 Likes

Can confirm I’m facing the same error

My temporary solution has been to install a previous version of the datasets package, in my case 2.6.1:

!pip install datasets==2.6.1

Let’s hope it is fixed soon!

1 Like

Yes, I can confirm that reverting to 2.6.1 worked for me as well.

same error stuck for days

Thanks for reporting. I opened a PR to fix this at Fix loading from HF GCP cache by lhoestq · Pull Request #5321 · huggingface/datasets · GitHub

In the meantime please use datasets==2.6.2

1 Like

This error is also raised for this dataset when no name is specified in the load_dataset method. I.e. load_dataset(“wikipedia”, None, split=“train”). While this is a rookie mistake, I found the resulting error to be misleading, hence the reply.