Cannot preprocess wikipedia dataset

Waterhorse · March 18, 2023, 1:16pm

I am using load_dataset to load the new Wikipedia dataset by:

load_dataset(“wikipedia”, language=“en”, date=“20230301”, beam_runner=“DirectRunner”)

I have successfully download the whole data, however, the procedure gets stuck after downloading it and I have to interrupt it:

Any solutions?

graelo · June 3, 2023, 8:47pm

Hi, just a hint from my experience preprocessing the es dataset:

I successfully preprocessed the es dump from 20230320, and that dataset is roughly 4 GB which is much smaller than the en dataset you chose. The preprocessing took about 3 hours and consumed up to 35 GB of ram, but it worked. Currently, there is one annoyance: no indication of progress, except if you monitor the size of the temporary file being created in the folder ~/.cache/huggingface/datasets/wikipedia/en/2.0.0/.... You can see that file growing about every 2 minutes.

So I’m guessing in your case the processing will take at least 12 hours and consume much more ram than 35 GB.

A good alternative to preprocessing locally would be to use DataflowRunner, but I don’t precisely know how to craft the beam_options arguments to do so.

My 2 cts!
Cheers and take care

Topic		Replies	Views
How to preprocess a wikipedia dataset using DataflowRunner? 🤗Datasets	3	827	June 12, 2023
Question about loading wikipedia datset 🤗Datasets	2	2372	November 11, 2020
Error loading Wikipedia Dataset 🤗Datasets	6	3058	July 5, 2023
Why does loading load_dataset('wiki_snippets', name='wiki40b_en_100_0') takes 3 hours when it only generates 12GB of data? 🤗Datasets	1	426	September 23, 2022
Wikihow dataset preprocessing? 🤗Datasets	5	806	October 9, 2020

Cannot preprocess wikipedia dataset

Related topics