Error loading Wikipedia Dataset

JosepRC · November 23, 2022, 4:56pm

Last week, the following code was working:
dataset = load_dataset(‘wikipedia’, ‘20220301.en’)

This week, it raises the following error:

MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at Apache Beam Capability Matrix
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory).
Example of usage:
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')

When I try to follow the example of usage, the error is this one:

Couldn’t find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json

Does someone know how to load the English Wikipedia dataset?

varadgunjal · November 29, 2022, 9:24pm

Can confirm I’m facing the same error

JosepRC · November 30, 2022, 3:52pm

My temporary solution has been to install a previous version of the datasets package, in my case 2.6.1:

!pip install datasets==2.6.1

Let’s hope it is fixed soon!

varadgunjal · November 30, 2022, 4:28pm

Yes, I can confirm that reverting to 2.6.1 worked for me as well.

kexin1 · December 1, 2022, 5:34am

same error stuck for days

lhoestq · December 1, 2022, 2:39pm

Thanks for reporting. I opened a PR to fix this at Fix loading from HF GCP cache by lhoestq · Pull Request #5321 · huggingface/datasets · GitHub

In the meantime please use datasets==2.6.2

TrueStateAI · July 5, 2023, 10:44pm

This error is also raised for this dataset when no name is specified in the load_dataset method. I.e. load_dataset(“wikipedia”, None, split=“train”). While this is a rookie mistake, I found the resulting error to be misleading, hence the reply.

Topic		Replies	Views
Question about loading wikipedia datset 🤗Datasets	2	2360	November 11, 2020
Load_dataset for wikipedia gets stuck 🤗Datasets	1	648	February 6, 2023
Error occurs when loading dataset with load_dataset() Beginners	5	4745	August 31, 2024
Streaming Wikipedia dataset 🤗Datasets	2	893	April 7, 2023
Cannot preprocess wikipedia dataset 🤗Datasets	1	502	June 3, 2023

Error loading Wikipedia Dataset

Related topics